jbilcke-hf commited on
Commit
ee81688
·
verified ·
1 Parent(s): b109524

Upload repository for paper 2510.20206

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/overview.png filter=lfs diff=lfs merge=lfs -text
APP_INFO.md ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RAPO++ Gradio App Documentation
2
+
3
+ ## Overview
4
+
5
+ This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.
6
+
7
+ ## What It Does
8
+
9
+ The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.
10
+
11
+ ## How It Works
12
+
13
+ ### Architecture
14
+
15
+ 1. **Knowledge Graph Construction**
16
+ - Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
17
+ - Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
18
+ - Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")
19
+
20
+ 2. **Retrieval Process**
21
+ - Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
22
+ - Finds top-K most similar places via cosine similarity
23
+ - Samples connected actions and atmosphere descriptors from graph neighbors
24
+ - Filters modifiers by relevance to the input prompt
25
+
26
+ 3. **Prompt Augmentation**
27
+ - Combines original prompt with retrieved modifiers
28
+ - Structures the output to maintain coherence
29
+ - Returns optimized prompt suitable for T2V generation
30
+
31
+ ### Key Components
32
+
33
+ **app.py** (main application):
34
+ - `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts
35
+ - `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU
36
+ - Gradio interface with examples and detailed documentation
37
+
38
+ **requirements.txt**:
39
+ - gradio 5.49.1 (pinned for compatibility)
40
+ - sentence-transformers + sentencepiece for embeddings
41
+ - torch 2.5.1 for tensor operations
42
+ - networkx for graph operations
43
+ - huggingface_hub for model downloads
44
+
45
+ ## Model Downloads
46
+
47
+ The app automatically downloads the required model on first run:
48
+ - **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB)
49
+
50
+ Downloaded to: `./ckpt/all-MiniLM-L6-v2/`
51
+
52
+ ## Usage
53
+
54
+ ### Basic Usage
55
+
56
+ 1. Enter a simple prompt (e.g., "A person walking")
57
+ 2. Click "Optimize Prompt"
58
+ 3. View the enhanced prompt with contextual details
59
+
60
+ ### Advanced Settings
61
+
62
+ - **Number of Places to Retrieve**: How many related places to search (1-5, default: 2)
63
+ - **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5)
64
+
65
+ ### Example Prompts
66
+
67
+ Try these examples to see the optimization in action:
68
+ - "A person walking"
69
+ - "A car driving at night"
70
+ - "Someone cooking in a kitchen"
71
+ - "A group of people talking"
72
+ - "A bird flying"
73
+ - "Someone sitting and reading"
74
+
75
+ ## Technical Details
76
+
77
+ ### Graph Structure
78
+
79
+ **Places (central nodes):**
80
+ - forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake
81
+
82
+ **Edge Types:**
83
+ - Place → Verb/Action edges (e.g., "forest" → "walking through")
84
+ - Place → Atmosphere edges (e.g., "forest" → "dense trees")
85
+
86
+ **Retrieval Algorithm:**
87
+ 1. Encode input prompt: `prompt_emb = model.encode(prompt)`
88
+ 2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)`
89
+ 3. Select top-K places by similarity score
90
+ 4. Sample neighbors from graph: `G.neighbors(place)`
91
+ 5. Deduplicate and rank modifiers
92
+
93
+ ### ZeroGPU Integration
94
+
95
+ The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
96
+ - Fast embedding computations
97
+ - Efficient cosine similarity calculations
98
+ - Scalability to larger graphs and batch processing
99
+
100
+ ### Differences from Full RAPO
101
+
102
+ This demo implements a **simplified version** of Stage 1 RAPO:
103
+
104
+ **Included:**
105
+ ✅ Knowledge graph with place-verb-scene relations
106
+ ✅ Embedding-based retrieval via SentenceTransformer
107
+ ✅ Cosine similarity ranking
108
+ ✅ Basic prompt augmentation
109
+
110
+ **Not Included (requires additional models/data):**
111
+ ❌ Full relation graph from paper (requires ~GB of graph data)
112
+ ❌ LLM-based sentence refactoring (Mistral-7B)
113
+ ❌ Iterative merging with similarity thresholds
114
+ ❌ Instruction-based rewriting (Llama3.1)
115
+
116
+ **Why This Approach:**
117
+ - Full RAPO requires 7B+ LLM downloads (~15GB+)
118
+ - Full graph data requires downloading preprocessed datasets
119
+ - This demo focuses on the **core concept**: retrieval-augmented prompt optimization
120
+ - Users can understand the methodology without waiting for large downloads
121
+
122
+ ## Running the Full RAPO Pipeline
123
+
124
+ To run the complete Stage 1 RAPO from the paper:
125
+
126
+ ```bash
127
+ cd examples/Stage1_RAPO
128
+
129
+ # 1. Retrieve modifiers from graph
130
+ sh retrieve_modifiers.sh
131
+
132
+ # 2. Word augmentation
133
+ sh word_augment.sh
134
+
135
+ # 3. Sentence refactoring
136
+ sh refactoring.sh
137
+
138
+ # 4. Instruction-based rewriting
139
+ sh rewrite_via_instruction.sh
140
+ ```
141
+
142
+ **Requirements:**
143
+ - Download full relation graph data to `relation_graph/graph_data/`
144
+ - Download Mistral-7B-Instruct-v0.3 to `ckpt/`
145
+ - Download llama3_1_instruct_lora_rewrite to `ckpt/`
146
+
147
+ See README.md for full installation instructions.
148
+
149
+ ## Integration with RAPO++ Stages
150
+
151
+ This demo showcases **Stage 1 only**. The complete RAPO++ framework includes:
152
+
153
+ **Stage 1 (RAPO)** - *Demonstrated Here*
154
+ - Retrieval-augmented prompt optimization via knowledge graphs
155
+ - Offline refinement using curated data
156
+
157
+ **Stage 2 (SSPO)**
158
+ - Self-supervised prompt optimization
159
+ - Iterative refinement based on generated video feedback
160
+ - Physics-aware consistency checks
161
+ - VLM-based alignment scoring
162
+
163
+ **Stage 3 (Fine-tuning)**
164
+ - LLM fine-tuning on collected feedback from Stage 2
165
+ - Model-specific prompt refiners
166
+
167
+ ## Performance Notes
168
+
169
+ - First run: ~1-2 minutes (downloads model)
170
+ - Subsequent runs: <1 second per prompt
171
+ - GPU allocation: Automatic via ZeroGPU
172
+ - Memory usage: ~500MB (model + graph)
173
+
174
+ ## Troubleshooting
175
+
176
+ **"No module named 'sentencepiece'"**
177
+ - Ensure `sentencepiece==0.2.1` is in requirements.txt
178
+ - sentence-transformers requires sentencepiece for tokenization
179
+
180
+ **"CUDA has been initialized before importing spaces"**
181
+ - The app correctly imports `spaces` FIRST before torch
182
+ - If you modify the code, maintain this import order
183
+
184
+ **Model download fails**
185
+ - Check internet connection
186
+ - HuggingFace Hub may be temporarily unavailable
187
+ - Model will retry on next run (cached after successful download)
188
+
189
+ ## References
190
+
191
+ **Papers:**
192
+ - [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts
193
+ - [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization
194
+
195
+ **Project Pages:**
196
+ - RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
197
+ - RAPO++: https://whynothaha.github.io/RAPO_plus_github/
198
+
199
+ **Code:**
200
+ - GitHub: https://github.com/Vchitect/RAPO
201
+
202
+ ## License
203
+
204
+ Please refer to the original repository for licensing information.
205
+
206
+ ---
207
+
208
+ **Created for HuggingFace Spaces deployment**
CLAUDE.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ RAPO++ is a three-stage framework for text-to-video (T2V) generation prompt optimization. It combines:
8
+ - **Stage 1 (RAPO)**: Retrieval-Augmented Prompt Optimization using relation graphs
9
+ - **Stage 2 (SSPO)**: Self-Supervised Prompt Optimization with test-time iterative refinement
10
+ - **Stage 3**: LLM fine-tuning on collected feedback data
11
+
12
+ The system is model-agnostic and works with various T2V models (Wan2.1, Open-Sora-Plan, HunyuanVideo, etc.).
13
+
14
+ ## Environment Setup
15
+
16
+ ```bash
17
+ # Create and activate environment
18
+ conda create -n rapo_plus python=3.10
19
+ conda activate rapo_plus
20
+
21
+ # Install dependencies
22
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
23
+ pip install -r requirement.txt
24
+ ```
25
+
26
+ ## Required Checkpoints
27
+
28
+ Download and place in `ckpt/` directory:
29
+
30
+ **Stage 1:**
31
+ - `all-MiniLM-L6-v2/` - Sentence transformer for embeddings
32
+ - `llama3_1_instruct_lora_rewrite/` - LLM for prompt rewriting
33
+ - `Mistral-7B-Instruct-v0.3/` - Alternative instruction-tuned LLM
34
+
35
+ **Stage 2 (example with Wan2.1):**
36
+ - `Wan2.1-T2V-1.3B-Diffusers/` - Base T2V model
37
+ - `Qwen2.5-7B-Instruct/` - Instruction-following LLM for prompt refinement
38
+ - `Qwen2.5-vl-7B-instruct/` - Vision-language model for video alignment assessment
39
+
40
+ Also place relation graph data in `relation_graph/graph_data/`.
41
+
42
+ ## Core Workflows
43
+
44
+ ### Stage 1: RAPO (Retrieval-Augmented Prompt Optimization)
45
+
46
+ **Location:** `examples/Stage1_RAPO/`
47
+
48
+ **Pipeline:**
49
+ 1. **Graph Construction** (`construct_graph.py`):
50
+ - Reads CSV with columns: `Input`, `verb_obj_word`, `scenario_word`, `place`
51
+ - Creates NetworkX graphs linking places to verbs and scenes
52
+ - Generates embeddings with SentenceTransformer
53
+ - Outputs: JSON dictionaries, GraphML files to `relation_graph/`
54
+
55
+ 2. **Modifier Retrieval** (`retrieve_modifiers.py`):
56
+ - Input: Test prompts from `data/test_prompts.txt`
57
+ - Encodes prompts and retrieves top-K related places via cosine similarity
58
+ - Samples connected verbs/scenes from graph neighbors
59
+ - Outputs: `output/retrieve_words/{filename}.txt` and `.csv`
60
+ - Run: `sh retrieve_modifiers.sh`
61
+
62
+ 3. **Word Augmentation** (`word_augment.py`):
63
+ - Filters retrieved modifiers by similarity threshold
64
+ - Merges modifiers interactively
65
+ - Run: `sh word_augment.sh`
66
+
67
+ 4. **Sentence Refactoring** (`refactoring.py`):
68
+ - Restructures prompts with augmented modifiers
69
+ - Run: `sh refactoring.sh`
70
+
71
+ 5. **Instruction-Based Rewriting** (`rewrite_via_instruction.py`):
72
+ - Uses LLM to refine prompts with natural language instructions
73
+ - Run: `sh rewrite_via_instruction.sh`
74
+
75
+ **Key Parameters:**
76
+ - `place_num`: Top-K places to retrieve (default: 3)
77
+ - `verb_num`, `topk_num`: Controls verb/scene sampling
78
+ - `SIMILARITY_THRESHOLD`: Filters modifiers in word_augment.py
79
+
80
+ ### Stage 2: SSPO (Self-Supervised Prompt Optimization)
81
+
82
+ **Location:** `examples/Stage2_SSPO/`
83
+
84
+ **Main Script:** `phyaware_wan2.1.py`
85
+
86
+ **Architecture:**
87
+ This script implements a closed-loop iterative optimization pipeline:
88
+
89
+ 1. **Video Generation** (`load_model()`, `generate_single_video()`):
90
+ - Uses WanPipeline to generate videos from prompts
91
+ - Configurable: height=480, width=832, num_frames=81, fps=15
92
+
93
+ 2. **Optical Flow Analysis** (`extract_optical_flow()`):
94
+ - Extracts motion statistics using cv2.calcOpticalFlowFarneback
95
+ - Samples frames at configurable intervals
96
+ - Returns sequence of (x, y) flow vectors
97
+
98
+ 3. **VLM Alignment Assessment** (`misalignment_assessment()`):
99
+ - Uses Qwen2.5-VL to evaluate video-prompt alignment
100
+ - Assesses objects, actions, scenes
101
+ - Returns textual alignment score (1-5 scale)
102
+
103
+ 4. **Physics Consistency Check + Prompt Refinement** (`evaluate_physical_consistency()`):
104
+ - **Phase 1**: LLM analyzes optical flow for physical plausibility (inertia, momentum, etc.)
105
+ - **Phase 2**: Fuses physics analysis + VLM alignment feedback
106
+ - Rewrites prompt to enforce physical rules and semantic alignment
107
+ - Uses Qwen2.5-7B-Instruct
108
+
109
+ 5. **Iterative Loop**:
110
+ - Generates video → Analyzes → Refines prompt → Generates again
111
+ - Default: 5 refinement iterations per prompt
112
+ - Logs to CSV: `results/examples_refined/refined_prompts.csv`
113
+
114
+ **Resume Capability:**
115
+ The script checks existing logs and videos to resume from last iteration, maintaining prompt chain consistency.
116
+
117
+ **Input Format:**
118
+ CSV with columns: `captions` (prompt), `phys_law` (physical rule to enforce)
119
+
120
+ **Key Configuration (lines 248-264):**
121
+ ```python
122
+ WAN_MODEL_ID = "../../ckpt/Wan2.1-T2V-1.3B-Diffusers"
123
+ INSTRUCT_LLM_PATH = "../../ckpt/Qwen2.5-7B-Instruct"
124
+ QWEN_VL_PATH = "../../ckpt/qwen2.5-vl-7B-instruct"
125
+ num_refine_iterations = 5
126
+ ```
127
+
128
+ ### Stage 3: LLM Fine-Tuning
129
+
130
+ Not provided in code; uses feedback data from Stage 2 to fine-tune model-specific prompt refiners.
131
+
132
+ ## Key Architectural Patterns
133
+
134
+ ### Graph-Based Retrieval (Stage 1)
135
+ - **Data Structure**: NetworkX graphs with place nodes as hubs
136
+ - **Retrieval**: Cosine similarity between prompt embeddings and place embeddings
137
+ - **Augmentation**: Graph neighbors provide contextually relevant modifiers
138
+ - **Caching**: Pre-computed embeddings stored in JSON for efficiency
139
+
140
+ ### Closed-Loop Optimization (Stage 2)
141
+ - **Multi-Modal Feedback**: Combines optical flow (physics) + VLM (semantics)
142
+ - **Iterative Refinement**: Each video informs next prompt
143
+ - **Logging**: CSV tracks full prompt evolution chain
144
+ - **Modularity**: Easy to swap T2V models, reward functions, or VLMs
145
+
146
+ ### Embedding Model Usage
147
+ - SentenceTransformer for text similarity (Stage 1)
148
+ - Pre-encode and cache all graph tokens to avoid redundant computation
149
+
150
+ ## Common Commands
151
+
152
+ **Stage 1 - Full Pipeline:**
153
+ ```bash
154
+ cd examples/Stage1_RAPO
155
+
156
+ # Build graph from scratch
157
+ python construct_graph.py
158
+
159
+ # Run full RAPO pipeline
160
+ sh retrieve_modifiers.sh
161
+ sh word_augment.sh
162
+ sh refactoring.sh
163
+ sh rewrite_via_instruction.sh
164
+ ```
165
+
166
+ **Stage 2 - SSPO:**
167
+ ```bash
168
+ cd examples/Stage2_SSPO
169
+ python phyaware_wan2.1.py
170
+ ```
171
+
172
+ ## File Dependencies
173
+
174
+ **Input Files:**
175
+ - `data/test_prompts.txt` - One prompt per line for Stage 1
176
+ - `examples/Stage2_SSPO/examples.csv` - Prompts + physical rules for Stage 2
177
+ - `relation_graph/graph_data/*.json` - Pre-built graph data
178
+ - `relation_graph/graph_data/*.graphml` - Graph structure
179
+
180
+ **Output Structure:**
181
+ - `examples/Stage1_RAPO/output/retrieve_words/` - Retrieved modifiers
182
+ - `examples/Stage1_RAPO/output/refactor/` - Augmented prompts
183
+ - `examples/Stage2_SSPO/results/examples_refined/` - Videos + logs
184
+
185
+ ## Critical Implementation Details
186
+
187
+ ### Stage 1 Graph Construction
188
+ - Place tokens serve as central nodes linking verbs and scenes
189
+ - Edge weights implicitly represent co-occurrence frequency
190
+ - Embedding dimension from SentenceTransformer: 384 (all-MiniLM-L6-v2)
191
+
192
+ ### Stage 2 Physics Analysis
193
+ The `evaluate_physical_consistency()` function uses a two-phase LLM prompting strategy:
194
+ 1. First call: Analyze optical flow for physics violations
195
+ 2. Second call: Synthesize physics + VLM feedback into refined prompt
196
+
197
+ The prompt rewriting instruction explicitly constrains:
198
+ - Motion continuity and force consistency
199
+ - Object states and timings
200
+ - Camera motion if needed
201
+ - Output limited to <120 words
202
+
203
+ ### Optical Flow Extraction
204
+ - Uses Farneback algorithm (dense optical flow)
205
+ - Samples frames at 0.5-second intervals by default
206
+ - Returns mean (x, y) flow per frame pair
207
+ - Sudden reversals or inconsistent magnitudes indicate physics violations
208
+
209
+ ## Model Swapping
210
+
211
+ **To use a different T2V model in Stage 2:**
212
+ 1. Update pipeline loading in `load_model()` function
213
+ 2. Adjust generation parameters (height, width, num_frames)
214
+ 3. Ensure model outputs diffusers-compatible format
215
+ 4. Update checkpoint path constants (lines 249-251)
216
+
217
+ **To use a different VLM:**
218
+ - Replace `Qwen2_5_VLForConditionalGeneration` with alternative
219
+ - Adjust processor and prompt template in `misalignment_assessment()`
220
+
221
+ **To use a different LLM for refinement:**
222
+ - Update `INSTRUCT_LLM_PATH` and ensure transformers compatibility
223
+ - Modify system/user message format if needed
224
+
225
+ ## Troubleshooting
226
+
227
+ **Graph loading errors:**
228
+ - Ensure all JSON files exist in `relation_graph/graph_data/`
229
+ - Check GraphML files are valid NetworkX format
230
+
231
+ **CUDA OOM:**
232
+ - Stage 2 loads 3 large models simultaneously (T2V, VLM, LLM)
233
+ - Reduce batch size or use smaller models
234
+ - Consider offloading models between steps
235
+
236
+ **Syntax error in phyaware_wan2.1.py line 251:**
237
+ - Missing opening quote: `QWEN_VL_PATH = ../../ckpt//qwen2.5-vl-7B-instruct"`
238
+ - Should be: `QWEN_VL_PATH = "../../ckpt/qwen2.5-vl-7B-instruct"`
239
+
240
+ ## Paper References
241
+
242
+ - **RAPO**: "The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation" (CVPR 2025)
243
+ - **RAPO++**: arXiv:2510.20206
244
+ - Project pages and models available on HuggingFace
README.md CHANGED
@@ -1,12 +1,56 @@
1
  ---
2
- title: SNIPED Rapo
3
- emoji: 🏃
4
  colorFrom: yellow
5
- colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: "RAPO"
3
+ emoji: 🤖
4
  colorFrom: yellow
5
+ colorTo: blue
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
10
+ short_description: "Manual Entry: https://huggingface.co/papers/2510.20206"
11
+ hardware: zerogpu
12
+ tags:
13
+ - research
14
+ - paper
15
+ - code
16
+ - cheatcode
17
+ license: mit
18
  ---
19
 
20
+ # RAPO
21
+
22
+ **Automated upload by CheatCode** 🚀
23
+
24
+ ## 📄 Paper Information
25
+
26
+ - **Paper ID**: 2510.20206
27
+ - **Title**: Manual Entry: https://huggingface.co/papers/2510.20206
28
+ - **Original Repository**: [https://github.com/Vchitect/RAPO](https://github.com/Vchitect/RAPO)
29
+
30
+ ## 🛠️ Repository Information
31
+
32
+ - **Languages**: Python, Shell
33
+ - **Gradio App**: ✅ Generated by CheatCode
34
+
35
+ ## 🤖 About CheatCode
36
+
37
+ This Space was automatically created by [CheatCode](https://github.com/jbilcke-hf/CheatCode),
38
+ an AI-powered tool that:
39
+
40
+ 1. Discovers research papers from HuggingFace
41
+ 2. Extracts and analyzes linked repositories
42
+ 3. Generates Gradio demo applications
43
+ 4. Uploads everything to HuggingFace Spaces
44
+
45
+ ## 📝 Usage
46
+
47
+ This Space includes a Gradio app that was automatically generated from the repository code.
48
+
49
+ ## ⚠️ Disclaimer
50
+
51
+ This is an automated upload. The code comes from the original repository and may require
52
+ additional configuration or dependencies to run properly.
53
+
54
+ ## 📜 License
55
+
56
+ Please refer to the original repository for licensing information: https://github.com/Vchitect/RAPO
README_original.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: RAPO++ Text-to-Video Prompt Optimization
3
+ emoji: 🎬
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 5.49.1
8
+ app_file: app.py
9
+ pinned: false
10
+ short_description: A three-stage framework for optimizing text-to-video generation prompts via retrieval, self-supervised refinement, and LLM fine-tuning
11
+ hardware: zerogpu
12
+ ---
13
+
14
+ # RAPO++: Prompting Test-Time Scaling for Text-to-Video Generation
15
+ <p align="center">
16
+ <a href="https://arxiv.org/pdf/2504.11739" target="_blank"><img src="https://img.shields.io/badge/Paper-RAPO-red"></a>
17
+ <a href='https://whynothaha.github.io/Prompt_optimizer/RAPO.html' target="_blank"><img src='https://img.shields.io/badge/ProjectPage-RAPO-blue'></a>
18
+ <a href="https://arxiv.org/abs/2510.20206" target="_blank"><img src="https://img.shields.io/badge/Paper-RAPO++-red"></a>
19
+ <a href='https://whynothaha.github.io/RAPO_plus_github/' target="_blank"><img src='https://img.shields.io/badge/ProjectPage-RAPO++-blue'></a>
20
+ <a href="https://huggingface.co/papers/2510.20206" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Daily Papers-red"></a>
21
+ </p >
22
+
23
+ <p align="center">
24
+ <strong><big>
25
+ If you find our work useful, please consider giving us a star🌟</big></strong>
26
+ </p>
27
+
28
+
29
+ ## 📚 AutoPage
30
+ Our website is automatically generated using our [**AutoPage**](https://mqleet.github.io/AutoPage_ProjectPage/), a multi-agent system we highly recommend for effortless academic page creation.
31
+
32
+ ## 📋 Table of Contents
33
+
34
+ This is the official implementation for
35
+ - [RAPO] [The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation, **CVPR 2025**](https://arxiv.org/abs/2502.07516)
36
+ - [RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling, arXiv:2510.20206](https://arxiv.org/abs/2510.20206)
37
+ - [🔎 Overview](#-overview)
38
+ - [🤗 Checkpoint](#-checkpoint)
39
+ - [🛠️ Installation](#-installation)
40
+ - [🚀 Quick Start](#-quick-start)
41
+ - [📐 Evaluation](#-evaluation)
42
+
43
+
44
+
45
+
46
+ ## 🔎 Overview
47
+ RAPO++ is a three-stage framework that enhances text-to-video generation without modifying model architectures. It unifies data-aligned prompt refinement (RAPO), test-time iterative optimization (SSPO), and LLM fine-tuning**, enabling more coherent, compositional, and physically realistic video synthesis. Tested on five state-of-the-art models and benchmarks, RAPO++ consistently improves semantic alignment, temporal stability, and visual fidelity, setting a new standard for prompt optimization in T2V generation.
48
+
49
+ The core contribution of RAPO++ lies in SSPO, a model-agnostic, closed-loop mechanism that iteratively refines prompts through feedback from generated videos. When using RAPO++, users can replace RAPO with their model’s built-in prompt refiner as initialization. The feedback data collected during SSPO can then be used to fine-tune the refiner itself, further enhancing model-specific prompt optimization.
50
+ ![Overview](assets/overview.png)
51
+
52
+
53
+
54
+
55
+
56
+
57
+ ## 🛠️ Installation
58
+ 1. Clone the Repository
59
+ ```
60
+ git clone https://github.com/Vchitect/RAPO.git
61
+ cd RAPO
62
+ ```
63
+ 2. Set up Environment
64
+ ```
65
+ conda create -n rapo_plus python=3.10
66
+ conda activate RAPO
67
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
68
+ pip install -r requirements.txt
69
+ ```
70
+
71
+ ## 🤗 Checkpoint
72
+ ### Stage 1 RAPO
73
+ Download the required model weights [RAPO](https://huggingface.co/bingjie/RAPO/tree/main), relation graph and pretrained LLM (e.g. , [
74
+ Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/tree/main) )and place them in the `ckpt/` and `relation_graph/` directory.
75
+ ```
76
+ ckpt/
77
+ │── all-MiniLM-L6-v2/
78
+ │── llama3_1_instruct_lora_rewrite/
79
+ │── Mistral-7B-Instruct-v0.3/
80
+ relation_graph/
81
+ │── graph_data/
82
+ ```
83
+ ### Stage 2 SSPO
84
+ We take Wan2.1-T2V as the base model to illustrate the process of SSPO. Download the required model weights [Wan2.1-T2V](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/tree/main), [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/tree/main), [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/tree/main]), and place them in the `ckpt/` directory.
85
+
86
+ ```
87
+ ckpt/
88
+ │── Wan2.1-T2V-1.3B-Diffusers/
89
+ │── Qwen2.5-7B-Instruct/
90
+ │── Qwen2.5-vl-7B-instruct/
91
+ ```
92
+
93
+
94
+
95
+ ## 🚀 Quick Start
96
+ ### Stage 1 RAPO
97
+ ```
98
+ cd ./examples/Stage1_RAPO/
99
+ ```
100
+ 0. We provide the codes to compose the graph data. We provide two examples of inputs to compose the graph data (`./dataset/graph_test1.csv` and `./dataset/graph_test2.csv`). You can build a relation graph from scratch based on the constructed data:
101
+ ```
102
+ python construct_graph.py
103
+ ```
104
+ or you can add data based on the already constructed relation graph:
105
+ ```
106
+ python add_to_graph.py
107
+ ```
108
+ 1. Retrieve related modifiers from relation graph. You can adjust the hyperparameters in `retrieve_modifiers.py` to modify the number of retrieval modifiers.
109
+ ```
110
+ sh retrieve_modifiers.sh
111
+ ```
112
+ 2. Word augmentation and sentence refactoring.
113
+ ```
114
+ sh word_augment.sh
115
+ sh refactoring.sh
116
+ ```
117
+ 3. Rewrite via instruction.
118
+ ```
119
+ sh rewrite_via_instruction.sh
120
+ ```
121
+ ### Stage 2 SSPO
122
+ ```
123
+ cd ./examples/Stage2_SSPO/
124
+ ```
125
+ We take **physical-aware video generation** based on Wan2.1 as an example. We provide an examples.csv file in this directory, which contains some test prompts and the physical rules that T2V generation needs to comply with.
126
+ For quickly start, the script generates and refines videos iteratively by combining Wan 2.1 T2V generation, Qwen2.5-VL alignment scoring, and physics-based prompt rewriting to enhance realism and consistency. You can modify the script to change the base model, include custom reward functions and historical-prompt backtracking for task-specific adaptation.
127
+ ```
128
+ python phyaware_wan2.1.py
129
+ ```
130
+
131
+ ### Stage 3 LLM finetuning
132
+ For LLM fine-tuning, the process depends on the selected T2V base models and further refines the Stage 2 naive-optimized prompts.
133
+ Examples include [Open-Sora-Plan](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0), [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B), [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo-PromptRewrite), and et.al.
134
+
135
+
136
+
137
+ ## ✒️ Citation
138
+ If you find our work helpful for your research, please consider giving a citation 📝
139
+
140
+ ```
141
+ @article{gao2025rapopp,
142
+ title = {RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling},
143
+ author = {Gao, Bingjie and Ma, Qianli and Wu, Xiaoxue and Yang, Shuai and Lan, Guanzhou and Zhao, Haonan and Chen, Jiaxuan and Liu, Qingyang and Qiao, Yu and Chen, Xinyuan and Wang, Yaohui and Niu, Li},
144
+ journal = {arXiv preprint arXiv:2510.20206},
145
+ year = {2025}
146
+ }
147
+ ```
148
+ ```
149
+ @InProceedings{Gao_2025_CVPR,
150
+ author = {Gao, Bingjie and Gao, Xinyu and Wu, Xiaoxue and Zhou, Yujie and Qiao, Yu and Niu, Li and Chen, Xinyuan and Wang, Yaohui},
151
+ title = {The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation},
152
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
153
+ month = {June},
154
+ year = {2025},
155
+ pages = {3173-3183}
156
+ }
157
+ ```
app.py ADDED
@@ -0,0 +1,349 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RAPO++ Text-to-Video Prompt Optimization Demo
3
+
4
+ This demo showcases Stage 1 (RAPO): Retrieval-Augmented Prompt Optimization
5
+ It demonstrates how simple prompts can be enriched with contextually relevant modifiers
6
+ retrieved from a knowledge graph for better text-to-video generation.
7
+ """
8
+
9
+ # CRITICAL: Import spaces FIRST before any CUDA-related packages
10
+ import spaces
11
+
12
+ import gradio as gr
13
+ import torch
14
+ from sentence_transformers import SentenceTransformer
15
+ from torch.nn.functional import cosine_similarity
16
+ import networkx as nx
17
+ import json
18
+ import os
19
+ import random
20
+ from huggingface_hub import snapshot_download, hf_hub_download
21
+
22
+ # =============================================================================
23
+ # Model and Data Setup (runs once at startup)
24
+ # =============================================================================
25
+
26
+ print("=" * 60)
27
+ print("Setting up RAPO++ Demo...")
28
+ print("=" * 60)
29
+
30
+ # Create necessary directories
31
+ os.makedirs("./ckpt", exist_ok=True)
32
+ os.makedirs("./relation_graph/graph_data", exist_ok=True)
33
+
34
+ # Download SentenceTransformer model for embeddings
35
+ SENTENCE_TRANSFORMER_PATH = "./ckpt/all-MiniLM-L6-v2"
36
+ if not os.path.exists(SENTENCE_TRANSFORMER_PATH):
37
+ print("Downloading SentenceTransformer model...")
38
+ snapshot_download(
39
+ repo_id="sentence-transformers/all-MiniLM-L6-v2",
40
+ local_dir=SENTENCE_TRANSFORMER_PATH,
41
+ local_dir_use_symlinks=False
42
+ )
43
+ print("✓ SentenceTransformer downloaded")
44
+ else:
45
+ print("✓ SentenceTransformer already cached")
46
+
47
+ # Load SentenceTransformer model
48
+ print("Loading SentenceTransformer model...")
49
+ embedding_model = SentenceTransformer(SENTENCE_TRANSFORMER_PATH)
50
+ print("✓ Model loaded")
51
+
52
+ # =============================================================================
53
+ # Simple Demo Graph Creation (since full graph data requires large download)
54
+ # =============================================================================
55
+
56
+ def create_demo_graph():
57
+ """Create a simplified demo graph with common T2V generation concepts"""
58
+
59
+ # Create sample place-verb and place-scene graphs
60
+ G_place_verb = nx.Graph()
61
+ G_place_scene = nx.Graph()
62
+
63
+ # Define places (central nodes)
64
+ places = [
65
+ "forest", "beach", "city street", "mountain", "room", "park",
66
+ "studio", "kitchen", "bridge", "parking lot", "desert", "lake"
67
+ ]
68
+
69
+ # Define verbs/actions for each place
70
+ place_verbs = {
71
+ "forest": ["walking through", "hiking in", "exploring", "camping in", "running through"],
72
+ "beach": ["walking on", "swimming at", "surfing at", "relaxing on", "playing on"],
73
+ "city street": ["walking down", "driving through", "running along", "biking through"],
74
+ "mountain": ["climbing", "hiking up", "descending", "exploring", "camping on"],
75
+ "room": ["sitting in", "working in", "relaxing in", "reading in", "sleeping in"],
76
+ "park": ["walking in", "playing in", "jogging through", "sitting in", "picnicking in"],
77
+ "studio": ["working in", "dancing in", "recording in", "practicing in"],
78
+ "kitchen": ["cooking in", "preparing food in", "baking in", "cleaning"],
79
+ "bridge": ["walking across", "driving across", "standing on", "running across"],
80
+ "parking lot": ["standing in", "walking through", "driving in", "parking in"],
81
+ "desert": ["walking through", "driving through", "camping in", "exploring"],
82
+ "lake": ["swimming in", "boating on", "fishing at", "relaxing by"]
83
+ }
84
+
85
+ # Define scenarios/atmospheres for each place
86
+ place_scenes = {
87
+ "forest": ["dense trees", "peaceful atmosphere", "natural setting", "quiet surroundings"],
88
+ "beach": ["ocean waves", "sunny day", "sandy shore", "coastal view"],
89
+ "city street": ["busy traffic", "urban environment", "city lights", "crowded sidewalk"],
90
+ "mountain": ["scenic view", "high altitude", "rocky terrain", "mountain peak"],
91
+ "room": ["indoor setting", "comfortable space", "quiet environment", "cozy atmosphere"],
92
+ "park": ["green grass", "open space", "trees around", "peaceful setting"],
93
+ "studio": ["professional lighting", "indoor space", "creative environment"],
94
+ "kitchen": ["modern appliances", "cooking area", "indoor setting", "bright lighting"],
95
+ "bridge": ["elevated view", "water below", "connecting path", "architectural structure"],
96
+ "parking lot": ["outdoor area", "vehicles around", "paved surface", "open space"],
97
+ "desert": ["sandy terrain", "hot climate", "barren landscape", "vast expanse"],
98
+ "lake": ["calm water", "natural scenery", "peaceful setting", "reflection on water"]
99
+ }
100
+
101
+ # Build graphs
102
+ for place in places:
103
+ # Add place-verb connections
104
+ for verb in place_verbs.get(place, []):
105
+ G_place_verb.add_edge(place, verb)
106
+
107
+ # Add place-scene connections
108
+ for scene in place_scenes.get(place, []):
109
+ G_place_scene.add_edge(place, scene)
110
+
111
+ # Create embeddings for all places
112
+ place_embeddings = embedding_model.encode(places)
113
+
114
+ # Create lookup dictionaries
115
+ place_to_idx = {place: idx for idx, place in enumerate(places)}
116
+ idx_to_place = {idx: place for place, idx in place_to_idx.items()}
117
+
118
+ return G_place_verb, G_place_scene, place_embeddings, place_to_idx, idx_to_place
119
+
120
+ # Initialize demo graph
121
+ print("Creating demo knowledge graph...")
122
+ G_place_verb, G_place_scene, place_embeddings, place_to_idx, idx_to_place = create_demo_graph()
123
+ print("✓ Demo graph created")
124
+ print("=" * 60)
125
+ print("✓ Setup complete!")
126
+ print("=" * 60)
127
+
128
+ # =============================================================================
129
+ # Core RAPO Functions
130
+ # =============================================================================
131
+
132
+ @spaces.GPU
133
+ def retrieve_and_augment_prompt(prompt: str, place_num: int = 2, modifier_num: int = 5) -> tuple:
134
+ """
135
+ Main RAPO function: Retrieves relevant modifiers from the graph and augments the prompt.
136
+
137
+ Args:
138
+ prompt: Input text-to-video generation prompt
139
+ place_num: Number of top places to retrieve
140
+ modifier_num: Number of modifiers to sample per place
141
+
142
+ Returns:
143
+ Tuple of (augmented_prompt, retrieved_info, places_found)
144
+ """
145
+ # Encode input prompt
146
+ prompt_embedding = embedding_model.encode(prompt)
147
+
148
+ # Compute similarity with all places
149
+ similarities = cosine_similarity(
150
+ torch.tensor(prompt_embedding).unsqueeze(0),
151
+ torch.tensor(place_embeddings)
152
+ )
153
+
154
+ # Get top-K most similar places
155
+ top_indices = torch.topk(similarities, min(place_num, len(place_to_idx))).indices
156
+
157
+ # Retrieve modifiers from graph
158
+ retrieved_verbs = []
159
+ retrieved_scenes = []
160
+ places_found = []
161
+
162
+ for idx in top_indices.numpy().tolist():
163
+ place = idx_to_place[idx]
164
+ places_found.append(place)
165
+
166
+ # Get verb neighbors
167
+ verb_neighbors = list(G_place_verb.neighbors(place))
168
+ verb_samples = random.sample(verb_neighbors, min(modifier_num, len(verb_neighbors)))
169
+ retrieved_verbs.extend(verb_samples)
170
+
171
+ # Get scene neighbors
172
+ scene_neighbors = list(G_place_scene.neighbors(place))
173
+ scene_samples = random.sample(scene_neighbors, min(modifier_num, len(scene_neighbors)))
174
+ retrieved_scenes.extend(scene_samples)
175
+
176
+ # Remove duplicates while preserving order
177
+ retrieved_verbs = list(dict.fromkeys(retrieved_verbs))
178
+ retrieved_scenes = list(dict.fromkeys(retrieved_scenes))
179
+
180
+ # Create augmented prompt (simple version - just add contextual details)
181
+ augmented_parts = [prompt.strip()]
182
+
183
+ # Add most relevant modifiers
184
+ if retrieved_verbs:
185
+ augmented_parts.append(f"The scene shows {retrieved_verbs[0]}")
186
+ if retrieved_scenes:
187
+ augmented_parts.append(f"with {retrieved_scenes[0]}")
188
+
189
+ augmented_prompt = ", ".join(augmented_parts) + "."
190
+
191
+ # Format retrieved info for display
192
+ retrieved_info = {
193
+ "Places": places_found,
194
+ "Actions": retrieved_verbs[:5],
195
+ "Atmosphere": retrieved_scenes[:5]
196
+ }
197
+
198
+ return augmented_prompt, retrieved_info, places_found
199
+
200
+ # =============================================================================
201
+ # Gradio Interface
202
+ # =============================================================================
203
+
204
+ def process_prompt(prompt, place_num, modifier_num):
205
+ """Process prompt and return results for Gradio"""
206
+ if not prompt.strip():
207
+ return "Please enter a prompt.", {}, []
208
+
209
+ try:
210
+ augmented_prompt, retrieved_info, places = retrieve_and_augment_prompt(
211
+ prompt, place_num, modifier_num
212
+ )
213
+
214
+ # Format retrieved info for display
215
+ info_text = "**Retrieved Modifiers:**\n\n"
216
+ info_text += f"**📍 Top Places:** {', '.join(places)}\n\n"
217
+ info_text += f"**🎬 Actions:** {', '.join(retrieved_info['Actions'])}\n\n"
218
+ info_text += f"**🌅 Atmosphere:** {', '.join(retrieved_info['Atmosphere'])}\n\n"
219
+
220
+ return augmented_prompt, info_text
221
+ except Exception as e:
222
+ return f"Error: {str(e)}", ""
223
+
224
+ # Create Gradio interface
225
+ with gr.Blocks(
226
+ theme=gr.themes.Soft(
227
+ primary_hue="purple",
228
+ secondary_hue="blue"
229
+ ),
230
+ title="RAPO++ Text-to-Video Prompt Optimization"
231
+ ) as demo:
232
+
233
+ gr.Markdown("""
234
+ # 🎬 RAPO++ Text-to-Video Prompt Optimization
235
+
236
+ This demo showcases **Stage 1 (RAPO)**: Retrieval-Augmented Prompt Optimization using knowledge graphs.
237
+
238
+ **How it works:**
239
+ 1. Enter a simple text-to-video prompt
240
+ 2. The system retrieves contextually relevant modifiers from a knowledge graph
241
+ 3. Your prompt is enhanced with specific actions and atmospheric details
242
+ 4. Use the optimized prompt for better T2V generation results!
243
+
244
+ **Example prompts to try:**
245
+ - "A person walking"
246
+ - "A car driving"
247
+ - "Someone cooking"
248
+ - "A group of people talking"
249
+
250
+ Based on the paper: [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206)
251
+ """)
252
+
253
+ with gr.Row():
254
+ with gr.Column(scale=1):
255
+ gr.Markdown("### Input")
256
+
257
+ input_prompt = gr.Textbox(
258
+ label="Original Prompt",
259
+ placeholder="Enter your text-to-video prompt (e.g., 'A person walking')",
260
+ lines=3
261
+ )
262
+
263
+ with gr.Accordion("Advanced Settings", open=False):
264
+ place_num = gr.Slider(
265
+ minimum=1,
266
+ maximum=5,
267
+ value=2,
268
+ step=1,
269
+ label="Number of Places to Retrieve",
270
+ info="How many related places to search in the knowledge graph"
271
+ )
272
+
273
+ modifier_num = gr.Slider(
274
+ minimum=1,
275
+ maximum=10,
276
+ value=5,
277
+ step=1,
278
+ label="Modifiers per Place",
279
+ info="How many modifiers to sample from each place"
280
+ )
281
+
282
+ process_btn = gr.Button("✨ Optimize Prompt", variant="primary", size="lg")
283
+
284
+ with gr.Column(scale=1):
285
+ gr.Markdown("### Results")
286
+
287
+ output_prompt = gr.Textbox(
288
+ label="Optimized Prompt",
289
+ lines=5,
290
+ show_copy_button=True
291
+ )
292
+
293
+ retrieved_info = gr.Markdown(
294
+ label="Retrieved Information"
295
+ )
296
+
297
+ # Example prompts
298
+ gr.Examples(
299
+ examples=[
300
+ ["A person walking", 2, 5],
301
+ ["A car driving at night", 2, 5],
302
+ ["Someone cooking in a kitchen", 2, 5],
303
+ ["A group of people talking", 2, 5],
304
+ ["A bird flying", 2, 5],
305
+ ["Someone sitting and reading", 2, 5],
306
+ ],
307
+ inputs=[input_prompt, place_num, modifier_num],
308
+ outputs=[output_prompt, retrieved_info],
309
+ fn=process_prompt,
310
+ cache_examples=False
311
+ )
312
+
313
+ gr.Markdown("""
314
+ ---
315
+ ### About RAPO++
316
+
317
+ RAPO++ is a three-stage framework for text-to-video generation prompt optimization:
318
+
319
+ - **Stage 1 (RAPO)**: Retrieval-Augmented Prompt Optimization using relation graphs *(demonstrated here)*
320
+ - **Stage 2 (SSPO)**: Self-Supervised Prompt Optimization with test-time iterative refinement
321
+ - **Stage 3**: LLM fine-tuning on collected feedback data
322
+
323
+ The system is model-agnostic and works with various T2V models (Wan2.1, Open-Sora-Plan, HunyuanVideo, etc.).
324
+
325
+ **Papers:**
326
+ - [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
327
+ - [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
328
+
329
+ **Project Page:** [https://whynothaha.github.io/RAPO_plus_github/](https://whynothaha.github.io/RAPO_plus_github/)
330
+
331
+ **GitHub:** [https://github.com/Vchitect/RAPO](https://github.com/Vchitect/RAPO)
332
+ """)
333
+
334
+ # Event handlers
335
+ process_btn.click(
336
+ fn=process_prompt,
337
+ inputs=[input_prompt, place_num, modifier_num],
338
+ outputs=[output_prompt, retrieved_info]
339
+ )
340
+
341
+ input_prompt.submit(
342
+ fn=process_prompt,
343
+ inputs=[input_prompt, place_num, modifier_num],
344
+ outputs=[output_prompt, retrieved_info]
345
+ )
346
+
347
+ # Launch the app
348
+ if __name__ == "__main__":
349
+ demo.launch()
assets/overview.png ADDED

Git LFS Details

  • SHA256: 5c270f0e4f10c4f1558d45341cb6479b3ca56f2db7aee21cd1584fe167d25ff2
  • Pointer size: 132 Bytes
  • Size of remote file: 1.63 MB
ckpt/temp.py ADDED
@@ -0,0 +1 @@
 
 
1
+
data/graph_test1.csv ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Input,Output,verb_obj_word,scenario_word,place
2
+ two boys standing in a parking lot and talking to each other. One of the boys is wearing a jacket and the other is wearing a vest. They seem to be having a friendly conversation.,"verb_obj_word: ['standing in a parking lot', 'talking to each other'], scenario_word: ['having a friendly conversation'], place: ['standing in a parking lot']","['standing in a parking lot', 'talking to each other']",['having a friendly conversation'],['standing in a parking lot']
3
+ a man and a woman sitting in chairs and talking to each other. The man is wearing a plaid shirt and the woman is wearing glasses. They seem to be discussing something.,"verb_obj_word: ['talking to each other'], scenario_word: ['discussing something'], place: ['sitting in chairs']",['talking to each other'],['discussing something'],['sitting in chairs']
4
+ a woman wearing a hat walking through a dense forest. She is carrying a camera and appears to be taking pictures.,"verb_obj_word: ['walking through a dense forest', 'carrying a camera', 'taking pictures'], scenario_word: [''], place: ['walking through a dense forest']","['walking through a dense forest', 'carrying a camera', 'taking pictures']",[''],['walking through a dense forest']
5
+ a woman working on a pottery wheel in a studio. She appears to be creating a piece of pottery by shaping and molding the clay on the wheel. The woman is focused on her work and seems to be enjoying the process.,"verb_obj_word: ['creating a piece of pottery by shaping and molding the clay on the wheel'], scenario_word: ['enjoying the process'], place: ['working on a pottery wheel in a studio']",['creating a piece of pottery by shaping and molding the clay on the wheel'],['enjoying the process'],['working on a pottery wheel in a studio']
6
+ a close-up view of a statue located on the side of a building. The statue appears to be made of stone and has intricate carvings on it.,"verb_obj_word: ['located on the side of a building'], scenario_word: [''], place: ['a close-up view of a statue located on the side of a building']",['located on the side of a building'],[''],['a close-up view of a statue located on the side of a building']
7
+ "a car driving through a city at night. The car appears to be a sports car, and the driver seems to be enjoying the ride. The city lights can be seen in the background.","verb_obj_word: ['driving through a city at night'], scenario_word: ['enjoying the ride'], place: ['driving a sports car in the city at night']",['driving through a city at night'],['enjoying the ride'],['driving a sports car in the city at night']
8
+ a group of snowboarders performing various tricks and stunts on a snow-covered slope. One of the snowboarders can be seen jumping off a ramp and performing a flip. The video captures the thrill and excitement of snowboarding in the mountains.,"verb_obj_word: ['performing various tricks and stunts', 'jumping off a ramp and performing a flip'], scenario_word: ['the thrill and excitement of snowboarding in the mountains'], place: ['performing tricks and stunts on a snow-covered slope']","['performing various tricks and stunts', 'jumping off a ramp and performing a flip']",['the thrill and excitement of snowboarding in the mountains'],['performing tricks and stunts on a snow-covered slope']
9
+ a green liquid being poured into a beaker. It appears to be a chemical reaction taking place.,"verb_obj_word: ['being poured into a beaker'], scenario_word: ['chemical reaction'], place: ['pouring a green liquid']",['being poured into a beaker'],['chemical reaction'],['pouring a green liquid']
10
+ a man riding a bicycle on a deserted road. He is wearing a yellow shirt and appears to be enjoying the ride. The road is surrounded by trees and there is no traffic in sight.,"verb_obj_word: ['riding a bicycle'], scenario_word: ['enjoying the ride'], place: ['riding a bicycle on a deserted road']",['riding a bicycle'],['enjoying the ride'],['riding a bicycle on a deserted road']
11
+ a woman holding a bunch of balloons while standing in a dark room. The balloons appear to be white in color.,"verb_obj_word: ['holding a bunch of balloons', 'standing in a dark room'], scenario_word: [''], place: ['holding a bunch of balloons in a dark room']","['holding a bunch of balloons', 'standing in a dark room']",[''],['holding a bunch of balloons in a dark room']
12
+ a group of people standing in front of a statue. They are wearing traditional clothing and appear to be posing for a photo. It seems to be a historical or cultural event.,"verb_obj_word: ['posing for a photo'], scenario_word: ['historical or cultural event'], place: ['standing in front of a statue']",['posing for a photo'],['historical or cultural event'],['standing in front of a statue']
13
+ a man wearing a jacket and sitting in front of a screen. He is talking and gesturing with his hands. The background of the video is a purple wall.,"verb_obj_word: ['talking and gesturing with his hands'], scenario_word: [''], place: ['sitting in front of a screen']",['talking and gesturing with his hands'],[''],['sitting in front of a screen']
14
+ a man walking across a bridge in a forest. The man is wearing a blue shirt and appears to be enjoying the scenery around him.,"verb_obj_word: ['walking across a bridge', 'enjoying the scenery'], scenario_word: [''], place: ['walking across a bridge in a forest']","['walking across a bridge', 'enjoying the scenery']",[''],['walking across a bridge in a forest']
15
+ "a car driving on a dirt road in a forest. The car appears to be old and rusty, and it seems to be stuck in the mud. The driver of the car seems to be trying to get it out of the mud.","verb_obj_word: ['driving on a dirt road', 'trying to get it out of the mud'], scenario_word: ['an old and rusty car stuck in the mud'], place: ['a car driving on a dirt road in a forest']","['driving on a dirt road', 'trying to get it out of the mud']",['an old and rusty car stuck in the mud'],['a car driving on a dirt road in a forest']
16
+ a man in a wheelchair who is talking to the camera. He is wearing a black shirt and appears to be in good spirits. There is also a group of people in the background who are dancing.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in good spirits'], place: ['a man in a wheelchair']",['talking to the camera'],['appears to be in good spirits'],['a man in a wheelchair']
data/graph_test2.csv ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Input,Output,verb_obj_word,scenario_word,place
2
+ a woman wearing a white shirt and sitting in a room. She is looking at the camera and smiling. The room appears to be dimly lit.,"verb_obj_word: ['looking at the camera', 'smiling'], scenario_word: ['appears to be dimly lit'], place: ['sitting in a room']","['looking at the camera', 'smiling']",['appears to be dimly lit'],['sitting in a room']
3
+ "a train traveling along a track in the countryside. The train appears to be moving at a steady pace, and the scenery around it is picturesque. There are no other objects or people visible in the video.","verb_obj_word: ['traveling along a track', 'moving at a steady pace'], scenario_word: ['picturesque scenery'], place: ['traveling in the countryside']","['traveling along a track', 'moving at a steady pace']",['picturesque scenery'],['traveling in the countryside']
4
+ a woman wearing a pink dress standing on a rooftop. She is holding a purse in her hand and appears to be posing for the camera.,"verb_obj_word: ['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera'], scenario_word: [''], place: ['standing on a rooftop']","['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera']",[''],['standing on a rooftop']
5
+ "a woman performing a ballet dance on a stage. She is wearing a pink costume and appears to be practicing her moves. The stage is dimly lit, and there are no other people or objects visible in the background.","verb_obj_word: ['performing a ballet dance', 'wearing a pink costume', 'practicing her moves'], scenario_word: [''], place: ['performing a ballet dance on a stage']","['performing a ballet dance', 'wearing a pink costume', 'practicing her moves']",[''],['performing a ballet dance on a stage']
6
+ "a man standing on a stage, holding a book and speaking to the audience. He is wearing a blue shirt and glasses, and he seems to be giving a lecture or presentation.","verb_obj_word: ['standing on a stage', 'holding a book and speaking to the audience'], scenario_word: ['giving a lecture or presentation'], place: ['standing on a stage']","['standing on a stage', 'holding a book and speaking to the audience']",['giving a lecture or presentation'],['standing on a stage']
7
+ a person wearing a blue jacket walking down a snow-covered road. The person seems to be enjoying the winter weather.,"verb_obj_word: ['walking down a snow-covered road'], scenario_word: ['enjoying the winter weather'], place: ['walking down a snow-covered road']",['walking down a snow-covered road'],['enjoying the winter weather'],['walking down a snow-covered road']
8
+ a group of people riding bicycles on a trail in the mountains. They seem to be enjoying the beautiful scenery and the fresh mountain air.,"verb_obj_word: ['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air'], scenario_word: [''], place: ['riding bicycles on a trail in the mountains']","['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air']",[''],['riding bicycles on a trail in the mountains']
9
+ "a group of people sitting at a desk and working on their computers. They appear to be focused on their tasks, and there is a sense of productivity in the air.","verb_obj_word: ['working on their computers'], scenario_word: ['a sense of productivity in the air'], place: ['sitting at a desk']",['working on their computers'],['a sense of productivity in the air'],['sitting at a desk']
10
+ a group of men sitting at a table and enjoying a meal together. They seem to be having a good time as they eat and chat with each other.,"verb_obj_word: ['eating a meal together', 'chatting with each other'], scenario_word: ['having a good time'], place: ['sitting at a table']","['eating a meal together', 'chatting with each other']",['having a good time'],['sitting at a table']
11
+ a young woman wearing a colorful shirt and headphones walking down a street while listening to music on her phone. She appears to be enjoying the music and the surroundings.,"verb_obj_word: ['walking down a street', 'listening to music on her phone'], scenario_word: ['enjoying the music and the surroundings'], place: ['walking down a street']","['walking down a street', 'listening to music on her phone']",['enjoying the music and the surroundings'],['walking down a street']
12
+ a woman singing and playing the guitar. She is wearing a polka dot dress and appears to be enjoying herself while performing.,"verb_obj_word: ['singing and playing the guitar'], scenario_word: ['enjoying herself while performing'], place: ['singing and playing the guitar']",['singing and playing the guitar'],['enjoying herself while performing'],['singing and playing the guitar']
13
+ a man sitting on a bench and reading a newspaper while drinking a cup of coffee. He seems to be enjoying his time and taking a break from his daily routine.,"verb_obj_word: ['reading a newspaper', 'drinking a cup of coffee'], scenario_word: ['enjoying his time', 'taking a break from his daily routine'], place: ['sitting on a bench']","['reading a newspaper', 'drinking a cup of coffee']","['enjoying his time', 'taking a break from his daily routine']",['sitting on a bench']
14
+ a pair of puppets sitting at a desk and talking to each other. The puppets are dressed in suits and appear to be having a conversation.,"verb_obj_word: ['talking to each other'], scenario_word: ['having a conversation'], place: ['a pair of puppets sitting at a desk']",['talking to each other'],['having a conversation'],['a pair of puppets sitting at a desk']
15
+ a man wearing a suit and tie playing the guitar. He appears to be a professional musician and is playing the guitar with great skill.,"verb_obj_word: ['playing the guitar with great skill'], scenario_word: ['appears to be a professional musician'], place: ['a man wearing a suit and tie playing the guitar']",['playing the guitar with great skill'],['appears to be a professional musician'],['a man wearing a suit and tie playing the guitar']
16
+ "a group of people walking down a busy street. They seem to be in a hurry, and there is a lot of traffic on the road. It appears to be a busy day in the city.","verb_obj_word: ['walking down a busy street', 'being in a hurry'], scenario_word: ['a busy day in the city'], place: ['walking down a busy street']","['walking down a busy street', 'being in a hurry']",['a busy day in the city'],['walking down a busy street']
17
+ a young boy standing in a room and talking to the camera. He is wearing a white shirt and appears to be in a playful mood.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in a playful mood'], place: ['standing in a room']",['talking to the camera'],['appears to be in a playful mood'],['standing in a room']
18
+ a group of people walking down the street with their dogs. They appear to be enjoying a leisurely stroll with their furry companions.,"verb_obj_word: ['walking down the street with their dogs'], scenario_word: ['enjoying a leisurely stroll'], place: ['walking down the street']",['walking down the street with their dogs'],['enjoying a leisurely stroll'],['walking down the street']
data/test_prompts.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A tranquil tableau of alley
2
+ A tranquil tableau of barn
3
+ a bird and a cat
4
+ a chair and a couch
5
+ a couch and a potted plant
6
+ a potted plant and a tv
7
+ a tv and a laptop
8
+ a laptop and a remote
9
+ a remote and a keyboard
10
+ a keyboard and a cell phone
11
+ a cell phone and a book
12
+ a book and a clock
13
+ A lightning striking atop of eiffel tower, dark clouds in the sky
14
+ a bicycle on the left of a car, front view
15
+ A modern art museum, with colorful paintings
examples/Stage1_RAPO/add_to_graph.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import ast
4
+ import torch
5
+ import numpy as np
6
+ import pandas as pd
7
+ import networkx as nx
8
+ from tqdm import tqdm
9
+ from collections import defaultdict
10
+ from sentence_transformers import SentenceTransformer
11
+
12
+ def open_dataset(filename):
13
+ """Load a JSON file and return its content."""
14
+ with open(filename, 'r') as file:
15
+ return json.load(file)
16
+
17
+ def update_graph_from_csv(
18
+ csv_file: str,
19
+ data_prefix_before: str,
20
+ data_prefix_after: str,
21
+ model_path: str = './ckpt/all-MiniLM-L6-v2',
22
+ valid_sentence_log: str = 'valid_sentence.txt'
23
+ ):
24
+ """Update word embeddings, indices, and co-occurrence graphs from new CSV data."""
25
+
26
+ device = "cuda" if torch.cuda.is_available() else "cpu"
27
+ model = SentenceTransformer(model_path)
28
+
29
+ # Load dictionaries
30
+ verb_to_idx = open_dataset(f'{data_prefix_before}/verb_to_idx.json')
31
+ scenario_to_idx = open_dataset(f'{data_prefix_before}/scenario_to_idx.json')
32
+ place_to_idx = open_dataset(f'{data_prefix_before}/place_to_idx.json')
33
+
34
+ # Load sentence index mappings
35
+ verb_in_sentence = open_dataset(f'{data_prefix_before}/verb_in_sentence.json')
36
+ scenario_in_sentence = open_dataset(f'{data_prefix_before}/scenario_in_sentence.json')
37
+ place_in_sentence = open_dataset(f'{data_prefix_before}/place_in_sentence.json')
38
+
39
+ # Load embeddings
40
+ verb_words_embed = open_dataset(f'{data_prefix_before}/verb_words_embed.json')
41
+ scenario_words_embed = open_dataset(f'{data_prefix_before}/scenario_words_embed.json')
42
+ place_embed = open_dataset(f'{data_prefix_before}/place_embed.json')
43
+
44
+ # Load graphs
45
+ G_place_verb = nx.read_graphml(f'{data_prefix_before}/graph_place_verb.graphml')
46
+ G_place_scene = nx.read_graphml(f'{data_prefix_before}/graph_place_scene.graphml')
47
+
48
+ # Load meta information
49
+ data_info = open_dataset(f'{data_prefix_before}/data_info.json')
50
+ valid_sentence = valid_cnt = data_info['valid_sentence']
51
+ v_idx, s_idx, p_idx = data_info['v_idx'], data_info['s_idx'], data_info['p_idx']
52
+
53
+ # Cache to avoid redundant encoding
54
+ verb_cache, scenario_cache, place_cache = {}, {}, {}
55
+
56
+ # Read new CSV data
57
+ df = pd.read_csv(csv_file)
58
+ texts = []
59
+
60
+ for i, row in df.iterrows():
61
+ sentence = row['Input']
62
+ try:
63
+ verb_obj_word = ast.literal_eval(row['verb_obj_word'])
64
+ scenario_word = ast.literal_eval(row['scenario_word'])
65
+ place = ast.literal_eval(row['place'])
66
+ except (ValueError, SyntaxError) as e:
67
+ print(f"Error parsing row {i}: {e}")
68
+ continue
69
+
70
+ # Sanitize empty lists
71
+ verb_obj_word = [] if not verb_obj_word or verb_obj_word[0] == '' else verb_obj_word
72
+ scenario_word = [] if not scenario_word or scenario_word[0] == '' else scenario_word
73
+ place = [] if not place or place[0] == '' else place
74
+
75
+ texts.append([verb_obj_word, scenario_word, place])
76
+
77
+ if len(verb_obj_word) > 0 and len(scenario_word) > 0 and len(place) > 0:
78
+ with open(valid_sentence_log, 'a') as f_valid:
79
+ f_valid.write(f'{sentence}\n')
80
+ valid_sentence += 1
81
+
82
+ print(f"{len(texts)} sentences have been read from the CSV file.")
83
+
84
+ # Process and update graph/embedding/index info
85
+ for i in tqdm(range(len(texts))):
86
+ verbs, scenes, places = texts[i]
87
+ if len(verbs) and len(scenes) and len(places):
88
+ for p in places:
89
+ p = p.strip()
90
+ for s in scenes:
91
+ s = s.strip()
92
+ if s not in scenario_cache:
93
+ s_emb = model.encode(s)
94
+ scenario_cache[s] = s_emb.tolist()
95
+ if s not in scenario_to_idx:
96
+ scenario_to_idx[s] = s_idx
97
+ s_idx += 1
98
+ scenario_words_embed.append(scenario_cache[s])
99
+ scenario_in_sentence.setdefault(s, []).append(valid_cnt)
100
+ G_place_scene.add_edge(p, s)
101
+
102
+ for v in verbs:
103
+ v = v.strip()
104
+ if v not in verb_cache:
105
+ v_emb = model.encode(v)
106
+ verb_cache[v] = v_emb.tolist()
107
+ if v not in verb_to_idx:
108
+ verb_to_idx[v] = v_idx
109
+ v_idx += 1
110
+ verb_words_embed.append(verb_cache[v])
111
+ verb_in_sentence.setdefault(v, []).append(valid_cnt)
112
+ G_place_verb.add_edge(p, v)
113
+
114
+ if p not in place_cache:
115
+ p_emb = model.encode(p)
116
+ place_cache[p] = p_emb.tolist()
117
+ if p not in place_to_idx:
118
+ place_to_idx[p] = p_idx
119
+ p_idx += 1
120
+ place_embed.append(place_cache[p])
121
+ place_in_sentence.setdefault(p, []).append(valid_cnt)
122
+
123
+ valid_cnt += 1
124
+
125
+ print(f"Valid sentences processed: {valid_cnt}")
126
+ print(f"Original valid sentence count: {valid_sentence}")
127
+
128
+ # Update and save metadata
129
+ data_info.update({
130
+ 'valid_sentence': valid_sentence,
131
+ 'p_idx': p_idx,
132
+ 's_idx': s_idx,
133
+ 'v_idx': v_idx
134
+ })
135
+
136
+ os.makedirs(data_prefix_after, exist_ok=True)
137
+
138
+ def save_json(data, name):
139
+ with open(os.path.join(data_prefix_after, f'{name}.json'), 'w') as f:
140
+ json.dump(data, f, indent=4)
141
+ print(f"{name} saved!")
142
+
143
+ # Save all updated data
144
+ save_json(data_info, 'data_info')
145
+ save_json(verb_to_idx, 'verb_to_idx')
146
+ save_json(scenario_to_idx, 'scenario_to_idx')
147
+ save_json(place_to_idx, 'place_to_idx')
148
+ save_json(verb_in_sentence, 'verb_in_sentence')
149
+ save_json(scenario_in_sentence, 'scenario_in_sentence')
150
+ save_json(place_in_sentence, 'place_in_sentence')
151
+ save_json(verb_words_embed, 'verb_words_embed')
152
+ save_json(scenario_words_embed, 'scenario_words_embed')
153
+ save_json(place_embed, 'place_embed')
154
+
155
+ # Save updated graphs
156
+ nx.write_graphml(G_place_verb, os.path.join(data_prefix_after, 'graph_place_verb.graphml'))
157
+ nx.write_graphml(G_place_scene, os.path.join(data_prefix_after, 'graph_place_scene.graphml'))
158
+
159
+ print("Graphs are saved!")
160
+
161
+ # Example usage
162
+ if __name__ == "__main__":
163
+ update_graph_from_csv(
164
+ csv_file="./data/graph_test2.csv",
165
+ data_prefix_before="./graph/graph_test1",
166
+ data_prefix_after="./graph/graph_test2"
167
+ )
examples/Stage1_RAPO/construct_graph.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import numpy as np
3
+ import pandas as pd
4
+ from sentence_transformers import SentenceTransformer
5
+ import networkx as nx
6
+ from tqdm import tqdm
7
+ import json
8
+ import ast
9
+ from collections import defaultdict
10
+ import os
11
+
12
+ def process_and_save_graph_data(
13
+ csv_file_path: str,
14
+ data_prefix: str,
15
+ model_path: str = './ckpt/all-MiniLM-L6-v2',
16
+ valid_sentence_log: str = 'valid_sentence.txt'
17
+ ):
18
+ device = "cuda" if torch.cuda.is_available() else "cpu"
19
+ model = SentenceTransformer(model_path)
20
+
21
+ # Initialize word-to-index dictionaries
22
+ verb_to_idx, scenario_to_idx, place_to_idx = {}, {}, {}
23
+
24
+ # Track sentence indices containing each word
25
+ verb_in_sentence = defaultdict(list)
26
+ scenario_in_sentence = defaultdict(list)
27
+ place_in_sentence = defaultdict(list)
28
+
29
+ # Store embeddings
30
+ verb_words_embed, scenario_words_embed, place_embed = [], [], []
31
+
32
+ # Cache for already encoded words
33
+ verb_cache, scenario_cache, place_cache = {}, {}, {}
34
+
35
+ # Graphs for co-occurrence relationships
36
+ G_place_scene = nx.Graph()
37
+ G_place_verb = nx.Graph()
38
+
39
+ data_info = {}
40
+ texts = []
41
+ valid_sentence = 0
42
+
43
+ df = pd.read_csv(csv_file_path)
44
+
45
+ # Read and preprocess CSV data
46
+ for i, row in df.iterrows():
47
+ sentence = row['Input']
48
+ try:
49
+ verb_obj_word = ast.literal_eval(row['verb_obj_word'])
50
+ scenario_word = ast.literal_eval(row['scenario_word'])
51
+ place = ast.literal_eval(row['place'])
52
+ except (ValueError, SyntaxError) as e:
53
+ print(f"Error parsing row {i}: {e}")
54
+ continue
55
+
56
+ # Handle empty lists
57
+ verb_obj_word = [] if not verb_obj_word or verb_obj_word[0] == '' else verb_obj_word
58
+ scenario_word = [] if not scenario_word or scenario_word[0] == '' else scenario_word
59
+ place = [] if not place or place[0] == '' else place
60
+
61
+ texts.append([verb_obj_word, scenario_word, place])
62
+
63
+ if len(verb_obj_word) > 0 and len(scenario_word) > 0 and len(place) > 0:
64
+ with open(valid_sentence_log, 'a') as f_valid:
65
+ f_valid.write(f'{sentence}\n')
66
+ valid_sentence += 1
67
+
68
+ print(f"{len(texts)} sentences have been read from the CSV file.")
69
+
70
+ v_idx = s_idx = p_idx = 0
71
+ valid_cnt = 0
72
+
73
+ # Batch process all tokens and encode them if needed
74
+ for i in tqdm(range(len(texts))):
75
+ verbs, scenes, places = texts[i]
76
+ if len(verbs) and len(scenes) and len(places):
77
+ for p in places:
78
+ # Process scene tokens
79
+ for s in scenes:
80
+ if s not in scenario_cache:
81
+ s_emb = model.encode(s)
82
+ scenario_cache[s] = s_emb.tolist()
83
+ if s not in scenario_to_idx:
84
+ scenario_to_idx[s] = s_idx
85
+ s_idx += 1
86
+ scenario_words_embed.append(scenario_cache[s])
87
+ scenario_in_sentence[s].append(valid_cnt)
88
+ G_place_scene.add_edge(p, s)
89
+
90
+ # Process verb tokens
91
+ for v in verbs:
92
+ if v not in verb_cache:
93
+ v_emb = model.encode(v)
94
+ verb_cache[v] = v_emb.tolist()
95
+ if v not in verb_to_idx:
96
+ verb_to_idx[v] = v_idx
97
+ v_idx += 1
98
+ verb_words_embed.append(verb_cache[v])
99
+ verb_in_sentence[v].append(valid_cnt)
100
+ G_place_verb.add_edge(p, v)
101
+
102
+ # Process place tokens
103
+ if p not in place_cache:
104
+ p_emb = model.encode(p)
105
+ place_cache[p] = p_emb.tolist()
106
+ if p not in place_to_idx:
107
+ place_to_idx[p] = p_idx
108
+ p_idx += 1
109
+ place_embed.append(place_cache[p])
110
+ place_in_sentence[p].append(valid_cnt)
111
+
112
+ valid_cnt += 1
113
+
114
+ assert valid_cnt == valid_sentence
115
+ data_info.update({
116
+ 'valid_sentence': valid_sentence,
117
+ 'p_idx': p_idx,
118
+ 's_idx': s_idx,
119
+ 'v_idx': v_idx
120
+ })
121
+
122
+ os.makedirs(data_prefix, exist_ok=True)
123
+
124
+ # Save dictionaries
125
+ def save_json(data, name):
126
+ with open(os.path.join(data_prefix, f'{name}.json'), 'w') as f:
127
+ json.dump(data, f, indent=4)
128
+ print(f"{name} saved!")
129
+
130
+ save_json(data_info, 'data_info')
131
+ save_json(verb_to_idx, 'verb_to_idx')
132
+ save_json(scenario_to_idx, 'scenario_to_idx')
133
+ save_json(place_to_idx, 'place_to_idx')
134
+ save_json(verb_in_sentence, 'verb_in_sentence')
135
+ save_json(scenario_in_sentence, 'scenario_in_sentence')
136
+ save_json(place_in_sentence, 'place_in_sentence')
137
+ save_json(verb_words_embed, 'verb_words_embed')
138
+ save_json(scenario_words_embed, 'scenario_words_embed')
139
+ save_json(place_embed, 'place_embed')
140
+
141
+ # Save graph files
142
+ nx.write_graphml(G_place_verb, os.path.join(data_prefix, 'graph_place_verb.graphml'))
143
+ nx.write_graphml(G_place_scene, os.path.join(data_prefix, 'graph_place_scene.graphml'))
144
+ print("Graphs are saved!")
145
+
146
+ # Example usage
147
+ if __name__ == "__main__":
148
+ process_and_save_graph_data(
149
+ csv_file_path="./data/graph_test1.csv",
150
+ data_prefix="./graph/graph_test1"
151
+ )
examples/Stage1_RAPO/data/graph_test1.csv ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Input,Output,verb_obj_word,scenario_word,place
2
+ two boys standing in a parking lot and talking to each other. One of the boys is wearing a jacket and the other is wearing a vest. They seem to be having a friendly conversation.,"verb_obj_word: ['standing in a parking lot', 'talking to each other'], scenario_word: ['having a friendly conversation'], place: ['standing in a parking lot']","['standing in a parking lot', 'talking to each other']",['having a friendly conversation'],['standing in a parking lot']
3
+ a man and a woman sitting in chairs and talking to each other. The man is wearing a plaid shirt and the woman is wearing glasses. They seem to be discussing something.,"verb_obj_word: ['talking to each other'], scenario_word: ['discussing something'], place: ['sitting in chairs']",['talking to each other'],['discussing something'],['sitting in chairs']
4
+ a woman wearing a hat walking through a dense forest. She is carrying a camera and appears to be taking pictures.,"verb_obj_word: ['walking through a dense forest', 'carrying a camera', 'taking pictures'], scenario_word: [''], place: ['walking through a dense forest']","['walking through a dense forest', 'carrying a camera', 'taking pictures']",[''],['walking through a dense forest']
5
+ a woman working on a pottery wheel in a studio. She appears to be creating a piece of pottery by shaping and molding the clay on the wheel. The woman is focused on her work and seems to be enjoying the process.,"verb_obj_word: ['creating a piece of pottery by shaping and molding the clay on the wheel'], scenario_word: ['enjoying the process'], place: ['working on a pottery wheel in a studio']",['creating a piece of pottery by shaping and molding the clay on the wheel'],['enjoying the process'],['working on a pottery wheel in a studio']
6
+ a close-up view of a statue located on the side of a building. The statue appears to be made of stone and has intricate carvings on it.,"verb_obj_word: ['located on the side of a building'], scenario_word: [''], place: ['a close-up view of a statue located on the side of a building']",['located on the side of a building'],[''],['a close-up view of a statue located on the side of a building']
7
+ "a car driving through a city at night. The car appears to be a sports car, and the driver seems to be enjoying the ride. The city lights can be seen in the background.","verb_obj_word: ['driving through a city at night'], scenario_word: ['enjoying the ride'], place: ['driving a sports car in the city at night']",['driving through a city at night'],['enjoying the ride'],['driving a sports car in the city at night']
8
+ a group of snowboarders performing various tricks and stunts on a snow-covered slope. One of the snowboarders can be seen jumping off a ramp and performing a flip. The video captures the thrill and excitement of snowboarding in the mountains.,"verb_obj_word: ['performing various tricks and stunts', 'jumping off a ramp and performing a flip'], scenario_word: ['the thrill and excitement of snowboarding in the mountains'], place: ['performing tricks and stunts on a snow-covered slope']","['performing various tricks and stunts', 'jumping off a ramp and performing a flip']",['the thrill and excitement of snowboarding in the mountains'],['performing tricks and stunts on a snow-covered slope']
9
+ a green liquid being poured into a beaker. It appears to be a chemical reaction taking place.,"verb_obj_word: ['being poured into a beaker'], scenario_word: ['chemical reaction'], place: ['pouring a green liquid']",['being poured into a beaker'],['chemical reaction'],['pouring a green liquid']
10
+ a man riding a bicycle on a deserted road. He is wearing a yellow shirt and appears to be enjoying the ride. The road is surrounded by trees and there is no traffic in sight.,"verb_obj_word: ['riding a bicycle'], scenario_word: ['enjoying the ride'], place: ['riding a bicycle on a deserted road']",['riding a bicycle'],['enjoying the ride'],['riding a bicycle on a deserted road']
11
+ a woman holding a bunch of balloons while standing in a dark room. The balloons appear to be white in color.,"verb_obj_word: ['holding a bunch of balloons', 'standing in a dark room'], scenario_word: [''], place: ['holding a bunch of balloons in a dark room']","['holding a bunch of balloons', 'standing in a dark room']",[''],['holding a bunch of balloons in a dark room']
12
+ a group of people standing in front of a statue. They are wearing traditional clothing and appear to be posing for a photo. It seems to be a historical or cultural event.,"verb_obj_word: ['posing for a photo'], scenario_word: ['historical or cultural event'], place: ['standing in front of a statue']",['posing for a photo'],['historical or cultural event'],['standing in front of a statue']
13
+ a man wearing a jacket and sitting in front of a screen. He is talking and gesturing with his hands. The background of the video is a purple wall.,"verb_obj_word: ['talking and gesturing with his hands'], scenario_word: [''], place: ['sitting in front of a screen']",['talking and gesturing with his hands'],[''],['sitting in front of a screen']
14
+ a man walking across a bridge in a forest. The man is wearing a blue shirt and appears to be enjoying the scenery around him.,"verb_obj_word: ['walking across a bridge', 'enjoying the scenery'], scenario_word: [''], place: ['walking across a bridge in a forest']","['walking across a bridge', 'enjoying the scenery']",[''],['walking across a bridge in a forest']
15
+ "a car driving on a dirt road in a forest. The car appears to be old and rusty, and it seems to be stuck in the mud. The driver of the car seems to be trying to get it out of the mud.","verb_obj_word: ['driving on a dirt road', 'trying to get it out of the mud'], scenario_word: ['an old and rusty car stuck in the mud'], place: ['a car driving on a dirt road in a forest']","['driving on a dirt road', 'trying to get it out of the mud']",['an old and rusty car stuck in the mud'],['a car driving on a dirt road in a forest']
16
+ a man in a wheelchair who is talking to the camera. He is wearing a black shirt and appears to be in good spirits. There is also a group of people in the background who are dancing.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in good spirits'], place: ['a man in a wheelchair']",['talking to the camera'],['appears to be in good spirits'],['a man in a wheelchair']
examples/Stage1_RAPO/data/graph_test2.csv ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Input,Output,verb_obj_word,scenario_word,place
2
+ a woman wearing a white shirt and sitting in a room. She is looking at the camera and smiling. The room appears to be dimly lit.,"verb_obj_word: ['looking at the camera', 'smiling'], scenario_word: ['appears to be dimly lit'], place: ['sitting in a room']","['looking at the camera', 'smiling']",['appears to be dimly lit'],['sitting in a room']
3
+ "a train traveling along a track in the countryside. The train appears to be moving at a steady pace, and the scenery around it is picturesque. There are no other objects or people visible in the video.","verb_obj_word: ['traveling along a track', 'moving at a steady pace'], scenario_word: ['picturesque scenery'], place: ['traveling in the countryside']","['traveling along a track', 'moving at a steady pace']",['picturesque scenery'],['traveling in the countryside']
4
+ a woman wearing a pink dress standing on a rooftop. She is holding a purse in her hand and appears to be posing for the camera.,"verb_obj_word: ['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera'], scenario_word: [''], place: ['standing on a rooftop']","['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera']",[''],['standing on a rooftop']
5
+ "a woman performing a ballet dance on a stage. She is wearing a pink costume and appears to be practicing her moves. The stage is dimly lit, and there are no other people or objects visible in the background.","verb_obj_word: ['performing a ballet dance', 'wearing a pink costume', 'practicing her moves'], scenario_word: [''], place: ['performing a ballet dance on a stage']","['performing a ballet dance', 'wearing a pink costume', 'practicing her moves']",[''],['performing a ballet dance on a stage']
6
+ "a man standing on a stage, holding a book and speaking to the audience. He is wearing a blue shirt and glasses, and he seems to be giving a lecture or presentation.","verb_obj_word: ['standing on a stage', 'holding a book and speaking to the audience'], scenario_word: ['giving a lecture or presentation'], place: ['standing on a stage']","['standing on a stage', 'holding a book and speaking to the audience']",['giving a lecture or presentation'],['standing on a stage']
7
+ a person wearing a blue jacket walking down a snow-covered road. The person seems to be enjoying the winter weather.,"verb_obj_word: ['walking down a snow-covered road'], scenario_word: ['enjoying the winter weather'], place: ['walking down a snow-covered road']",['walking down a snow-covered road'],['enjoying the winter weather'],['walking down a snow-covered road']
8
+ a group of people riding bicycles on a trail in the mountains. They seem to be enjoying the beautiful scenery and the fresh mountain air.,"verb_obj_word: ['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air'], scenario_word: [''], place: ['riding bicycles on a trail in the mountains']","['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air']",[''],['riding bicycles on a trail in the mountains']
9
+ "a group of people sitting at a desk and working on their computers. They appear to be focused on their tasks, and there is a sense of productivity in the air.","verb_obj_word: ['working on their computers'], scenario_word: ['a sense of productivity in the air'], place: ['sitting at a desk']",['working on their computers'],['a sense of productivity in the air'],['sitting at a desk']
10
+ a group of men sitting at a table and enjoying a meal together. They seem to be having a good time as they eat and chat with each other.,"verb_obj_word: ['eating a meal together', 'chatting with each other'], scenario_word: ['having a good time'], place: ['sitting at a table']","['eating a meal together', 'chatting with each other']",['having a good time'],['sitting at a table']
11
+ a young woman wearing a colorful shirt and headphones walking down a street while listening to music on her phone. She appears to be enjoying the music and the surroundings.,"verb_obj_word: ['walking down a street', 'listening to music on her phone'], scenario_word: ['enjoying the music and the surroundings'], place: ['walking down a street']","['walking down a street', 'listening to music on her phone']",['enjoying the music and the surroundings'],['walking down a street']
12
+ a woman singing and playing the guitar. She is wearing a polka dot dress and appears to be enjoying herself while performing.,"verb_obj_word: ['singing and playing the guitar'], scenario_word: ['enjoying herself while performing'], place: ['singing and playing the guitar']",['singing and playing the guitar'],['enjoying herself while performing'],['singing and playing the guitar']
13
+ a man sitting on a bench and reading a newspaper while drinking a cup of coffee. He seems to be enjoying his time and taking a break from his daily routine.,"verb_obj_word: ['reading a newspaper', 'drinking a cup of coffee'], scenario_word: ['enjoying his time', 'taking a break from his daily routine'], place: ['sitting on a bench']","['reading a newspaper', 'drinking a cup of coffee']","['enjoying his time', 'taking a break from his daily routine']",['sitting on a bench']
14
+ a pair of puppets sitting at a desk and talking to each other. The puppets are dressed in suits and appear to be having a conversation.,"verb_obj_word: ['talking to each other'], scenario_word: ['having a conversation'], place: ['a pair of puppets sitting at a desk']",['talking to each other'],['having a conversation'],['a pair of puppets sitting at a desk']
15
+ a man wearing a suit and tie playing the guitar. He appears to be a professional musician and is playing the guitar with great skill.,"verb_obj_word: ['playing the guitar with great skill'], scenario_word: ['appears to be a professional musician'], place: ['a man wearing a suit and tie playing the guitar']",['playing the guitar with great skill'],['appears to be a professional musician'],['a man wearing a suit and tie playing the guitar']
16
+ "a group of people walking down a busy street. They seem to be in a hurry, and there is a lot of traffic on the road. It appears to be a busy day in the city.","verb_obj_word: ['walking down a busy street', 'being in a hurry'], scenario_word: ['a busy day in the city'], place: ['walking down a busy street']","['walking down a busy street', 'being in a hurry']",['a busy day in the city'],['walking down a busy street']
17
+ a young boy standing in a room and talking to the camera. He is wearing a white shirt and appears to be in a playful mood.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in a playful mood'], place: ['standing in a room']",['talking to the camera'],['appears to be in a playful mood'],['standing in a room']
18
+ a group of people walking down the street with their dogs. They appear to be enjoying a leisurely stroll with their furry companions.,"verb_obj_word: ['walking down the street with their dogs'], scenario_word: ['enjoying a leisurely stroll'], place: ['walking down the street']",['walking down the street with their dogs'],['enjoying a leisurely stroll'],['walking down the street']
examples/Stage1_RAPO/data/test_prompts.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A tranquil tableau of alley
2
+ A tranquil tableau of barn
3
+ a bird and a cat
4
+ a chair and a couch
5
+ a couch and a potted plant
6
+ a potted plant and a tv
7
+ a tv and a laptop
8
+ a laptop and a remote
9
+ a remote and a keyboard
10
+ a keyboard and a cell phone
11
+ a cell phone and a book
12
+ a book and a clock
13
+ A lightning striking atop of eiffel tower, dark clouds in the sky
14
+ a bicycle on the left of a car, front view
15
+ A modern art museum, with colorful paintings
examples/Stage1_RAPO/refactoring.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ from transformers import AutoModelForCausalLM, AutoTokenizer
4
+ from tqdm import tqdm
5
+ import argparse
6
+
7
+ def get_output(prompt):
8
+ template = (
9
+ 'Refine the sentence: "{}" to contain subject description, action, scene description. '
10
+ 'Transform user-entered text into a concise, detailed description with a specific structure. '
11
+ '(Optional: camera language, light and shadow, atmosphere) and conceive some additional actions to make the sentence more dynamic. '
12
+ 'Make sure it is a fluent sentence, not nonsense.'
13
+ )
14
+ prompt_text = template.format(prompt)
15
+ messages = [
16
+ {"role": "system", "content": "You are a caption refiner."},
17
+ {"role": "user", "content": prompt_text}
18
+ ]
19
+
20
+ # prepare inputs
21
+ input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
22
+ model_inputs = tokenizer([input_ids], return_tensors="pt").to(device)
23
+
24
+ # generate
25
+ generated_ids = model.generate(
26
+ model_inputs.input_ids,
27
+ max_new_tokens=512
28
+ )
29
+ # strip prompt prefix
30
+ generated_ids = [
31
+ output_ids[len(input_ids):]
32
+ for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
33
+ ]
34
+ responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
35
+ return responses[0]
36
+
37
+
38
+ def get_start_index(txt_path):
39
+ """
40
+ Read existing output file to determine resume index (number of lines).
41
+ """
42
+ if os.path.exists(txt_path):
43
+ with open(txt_path, 'r', encoding='utf-8') as f:
44
+ return len(f.readlines())
45
+ return 0
46
+
47
+
48
+ def main():
49
+ # determine from which line to resume
50
+ start_idx = get_start_index(output_path)
51
+
52
+ # read all prompts
53
+ with open(input_path, 'r', encoding='utf-8') as f:
54
+ prompts = [line.strip() for line in f if line.strip()]
55
+
56
+ # open output file for append
57
+ with open(output_path, 'a', encoding='utf-8') as outf:
58
+ for i in tqdm(range(start_idx, len(prompts)), desc="Refining prompts"):
59
+ prompt = prompts[i]
60
+ try:
61
+ refined = get_output(prompt)
62
+ except Exception as e:
63
+ refined = f"[ERROR] {e}"
64
+ outf.write(refined + '\n')
65
+
66
+ if __name__ == '__main__':
67
+ parser = argparse.ArgumentParser(description='Refine captions and output to text file')
68
+ parser.add_argument(
69
+ '--mode_path', type=str,
70
+ default='llama3_8B_lora_merged_cn',
71
+ help='Model path or identifier'
72
+ )
73
+ parser.add_argument(
74
+ '--input_word_augmentation', type=str,
75
+ default='./output/refactor/merging_reuslts.txt',
76
+ help='Path to input text prompts'
77
+ )
78
+ parser.add_argument(
79
+ '--output_refactoring', type=str,
80
+ default='./output/refactor/refactoring_results.txt',
81
+ help='Path to output text file'
82
+ )
83
+ args = parser.parse_args()
84
+
85
+ # setup
86
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
87
+ tokenizer = AutoTokenizer.from_pretrained(args.mode_path, trust_remote_code=True)
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ args.mode_path,
90
+ trust_remote_code=True
91
+ ).to(device).eval()
92
+
93
+ input_path = args.input_word_augmentation
94
+ output_path = args.output_refactoring
95
+
96
+ main()
examples/Stage1_RAPO/refactoring.sh ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ python refactoring.py \
2
+ --mode_path "../../ckpt/llama3_1_instruct_lora_rewrite" \
3
+ --input_word_augmentation "./output/refactor/merging_reuslts.txt" \
4
+ --output_refactoring "./output/refactor/refactoring_reuslts.txt" \
examples/Stage1_RAPO/retrieve_modifiers.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import random
2
+ import torch
3
+ import json
4
+ import networkx as nx
5
+ from sentence_transformers import SentenceTransformer
6
+ from torch.nn.functional import cosine_similarity
7
+ from tqdm import tqdm
8
+ import csv
9
+ import os
10
+ import argparse
11
+
12
+ model = SentenceTransformer('./ckpt/all-MiniLM-L6-v2')
13
+
14
+ def open_dataset(filename):
15
+ with open(filename, 'r') as file:
16
+ data = json.load(file)
17
+ return data
18
+
19
+ if __name__ == "__main__":
20
+ parser = argparse.ArgumentParser(description='Retrieve graph construction.')
21
+ parser.add_argument('--graph_data_dir', type=str, required=True)
22
+ parser.add_argument('--output_filename', type=str, required=True)
23
+
24
+ # Get command line arguments
25
+ args = parser.parse_args()
26
+ place_num = 3
27
+ verb_num = 5
28
+ topk_num = verb_num
29
+ Retrieve_num = 30
30
+
31
+ # Setting variables via command line arguments
32
+ graph_data_dir = args.graph_data_dir
33
+ output_filename = args.output_filename
34
+
35
+ test_file = f'./data/test_prompts.txt'
36
+ output_txt = f'./output/retrieve_words/{output_filename}.txt'
37
+ output_csv = f'./output/retrieve_words/{output_filename}.csv'
38
+
39
+ # load verb_to_idx, scenario_to_idx, place_to_idx,verb_in_sentence, scenario_in_sentence, place_in_sentence,verb_words_embed, scenario_words_embed, place_embed
40
+ verb_to_idx = open_dataset(f'{graph_data_dir}/verb_to_idx.json')
41
+ scenario_to_idx = open_dataset(f'{graph_data_dir}/scenario_to_idx.json')
42
+ place_to_idx = open_dataset(f'{graph_data_dir}/place_to_idx.json')
43
+ idx_to_place = {v: k for k, v in place_to_idx.items()}
44
+ verb_in_sentence = open_dataset(f'{graph_data_dir}/verb_in_sentence.json')
45
+ scenario_in_sentence = open_dataset(f'{graph_data_dir}/scenario_in_sentence.json')
46
+ place_in_sentence = open_dataset(f'{graph_data_dir}/place_in_sentence.json')
47
+ verb_words_embed = open_dataset(f'{graph_data_dir}/verb_words_embed.json')
48
+ scenario_words_embed = open_dataset(f'{graph_data_dir}/scenario_words_embed.json')
49
+ place_embed = open_dataset(f'{graph_data_dir}/place_embed.json')
50
+
51
+ # Loading graph structure
52
+ G_place_verb = nx.read_graphml(f'{graph_data_dir}/graph_place_verb.graphml')
53
+ G_place_scene = nx.read_graphml(f'{graph_data_dir}/graph_place_scene.graphml')
54
+
55
+ output_folder = os.path.dirname(output_txt)
56
+ os.makedirs(output_folder, exist_ok=True)
57
+ output_csv_folder = os.path.dirname(output_csv)
58
+ os.makedirs(output_csv_folder, exist_ok=True)
59
+
60
+ verb_obj_word, scenario_word, place = "", "", ""
61
+ with open(test_file, 'r') as f:
62
+ total_line = sum(1 for _ in f)
63
+ f.seek(0)
64
+ for i, line in enumerate(tqdm(f.readlines(), total=total_line)):
65
+ sentence = line.replace('\n', "")
66
+ potential_action, potential_sub_atmos, potential_scene = [], [], []
67
+ sentence_emb = model.encode(sentence)
68
+ sim = cosine_similarity(torch.tensor(sentence_emb).unsqueeze(0), torch.tensor(place_embed))
69
+ top1_idx = torch.topk(sim, place_num).indices
70
+ for idx in top1_idx.numpy().tolist():
71
+ verb_neighbors = list(G_place_verb.neighbors(idx_to_place[idx]))
72
+ scene_neighbors = list(G_place_scene.neighbors(idx_to_place[idx]))
73
+ verb_random = random.sample(verb_neighbors, verb_num) if len(verb_neighbors) >= verb_num else verb_neighbors
74
+ scene_random = random.sample(scene_neighbors, verb_num) if len(scene_neighbors) >= verb_num else scene_neighbors
75
+
76
+ v_random_embed = []
77
+ for v_random in verb_random:
78
+ if v_random in verb_to_idx:
79
+ v_random_embed.append(verb_words_embed[verb_to_idx[v_random]])
80
+ if len(v_random_embed) > 0:
81
+ v_sim = cosine_similarity(torch.tensor(v_random_embed), torch.tensor(sentence_emb).unsqueeze(0))
82
+ v_random_candidate = torch.topk(v_sim, topk_num if len(v_sim) >= topk_num else len(v_sim)).indices.numpy().tolist()
83
+ potential_action += [verb_random[i] for i in v_random_candidate]
84
+
85
+ s_random_embed, p_random_embed = [], []
86
+ place_cross_word = []
87
+ for s_random in scene_random:
88
+ if s_random in scenario_to_idx:
89
+ s_random_embed.append(scenario_words_embed[scenario_to_idx[s_random]])
90
+ else:
91
+ p_random_embed.append(place_embed[place_to_idx[s_random]])
92
+ place_cross_word.append(s_random)
93
+ if len(s_random_embed) > 0:
94
+ s_sim = cosine_similarity(torch.tensor(s_random_embed), torch.tensor(sentence_emb).unsqueeze(0))
95
+ s_random_candidate = torch.topk(s_sim, topk_num if len(s_sim) >= topk_num else len(s_sim)).indices.numpy().tolist()
96
+
97
+ if len(place_cross_word) > 0:
98
+ p_sim = cosine_similarity(torch.tensor(p_random_embed), torch.tensor(sentence_emb).unsqueeze(0))
99
+ p_random_candidate = torch.topk(p_sim, 1 if len(p_sim) >= 1 else len(p_sim)).indices.numpy().tolist()
100
+ potential_scene += [place_cross_word[k] for k in p_random_candidate]
101
+
102
+ potential_sub_atmos += [scene_random[j] for j in s_random_candidate]
103
+ potential_scene.append(idx_to_place[idx])
104
+
105
+ word_set = set(potential_action + potential_sub_atmos + potential_scene)
106
+
107
+ with open(output_txt, 'a') as f_txt:
108
+ f_txt.write(f'{sentence}. {", ".join(word_set)}\n')
109
+
110
+ with open(output_csv, 'a', encoding='utf-8', newline="") as fc:
111
+ writer = csv.writer(fc)
112
+ if i < Retrieve_num:
113
+ writer.writerow(['sentence', 'potential_action', 'potential_sub_atmos', 'potential_scene'])
114
+ writer.writerow([sentence, set(potential_action), set(potential_sub_atmos), set(potential_scene)])
115
+
116
+ print("Retrieve process is finished!")
examples/Stage1_RAPO/retrieve_modifiers.sh ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ python retrieve_modifiers.py \
2
+ --graph_data_dir "relation_graph/graph_data" \
3
+ --output_filename "retrieved_words" \
examples/Stage1_RAPO/rewrite_via_instruction.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from transformers import AutoModelForCausalLM, AutoTokenizer
3
+ import torch
4
+ import re
5
+ import argparse
6
+ import os
7
+
8
+ def extract_output(model_output):
9
+ match = re.search(r'Final Output:\s*(.*)', model_output, re.IGNORECASE)
10
+ if match:
11
+ return match.group(1).strip()
12
+ return model_output.strip()
13
+
14
+ if __name__ == "__main__":
15
+ parser = argparse.ArgumentParser(description='Process text and generate output.')
16
+ parser.add_argument('--input_path', type=str, required=True)
17
+ parser.add_argument('--output_path', type=str, required=True)
18
+ args = parser.parse_args()
19
+
20
+ input_file_path = args.input_path
21
+ output_path = args.output_path
22
+
23
+ model_id = './ckpt/Mistral-7B-Instruct-v0.3/'
24
+ tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
25
+ tokenizer.pad_token = tokenizer.eos_token
26
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
27
+
28
+ device = "cuda" if torch.cuda.is_available() else "cpu"
29
+ model.to(device)
30
+
31
+ output_data = []
32
+ with open(input_file_path, "r") as infile:
33
+ for line in infile:
34
+ The_current_input = line.strip()
35
+ prompts = [f"""Please limit your output to 30 words or less. Suppose you are a Text Aligner, and your role is to transform user-entered text into a concise, detailed description with a specific structure. You should ensure that the output text is coherent, contextually relevant, and follows the same structure as the examples provided. The output should frequently use common phrases like 'She is,' 'appears to,' 'There are,' 'seems to,' 'It appears to,' and 'They are.' The sentence structure should maintain clarity and be specific about locations and actions. The output should incorporate high-frequency words such as: appears, wearing, woman, enjoying, man, view, sitting, group, seems, seem, people, young, standing, time, beautiful, white, closeup, holding, aerial, shirt, video, appear, surrounded, playing, together, peaceful, front, background, focused, using, working, table, good, black, person, serene, others, sky, walking, trees, around, room, city, water, visible, green, blue, captures, camera, something.
36
+ Examples provided:
37
+ (1) input: A child plays with toys. output: a young child playing with toys in a room. The child is sitting on the floor surrounded by various toys and appears to be having fun.
38
+ (2) input: A bear explores its surroundings. output: a black bear walking around in a grassy area. The bear appears to be exploring its surroundings and seems to be curious about its environment.
39
+ (3) input: A woman ties a string around an orange. output: a woman sitting at a table and tying a string around an orange. She is wearing a brown robe and appears to be preparing a gift.
40
+ (4) input: A doctor performs a procedure on a patient. output: a man wearing a surgical gown and mask standing next to a patient in a hospital room. The man is a doctor who is performing a procedure on the patient.
41
+ (5) input: A monkey looks around. output: a monkey sitting on a tree branch. The monkey appears to be looking around and seems to be curious about its surroundings.
42
+ (6) input: People discuss something at a table. output: a group of people gathered around a table, discussing something together. It appears to be a business meeting or a brainstorming session. The people in the video are engaged in a conversation and seem to be focused on the topic at hand.
43
+ (7) input: A girl holds a cardboard star. output: a young girl wearing a blue dress and holding a cardboard star. She is standing in front of a white background and appears to be smiling.
44
+ (8) input: Young people swim in a pool. output: a group of young people having fun in a swimming pool. They are all wearing swimsuits and enjoying themselves. One of the people in the pool is wearing a bikini.
45
+ (9) input: A woman writes with a pen. output: a close-up shot of a woman's hand holding a pen and writing on a piece of paper. The woman is wearing a ring on her finger and appears to be focused on her work.
46
+ The current input: {The_current_input} , Final Output:"""
47
+ ]
48
+
49
+ inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device).input_ids
50
+ attention_mask = torch.ones(inputs.shape, dtype=torch.long, device=device)
51
+ outputs = model.generate(inputs, max_new_tokens=500, pad_token_id=tokenizer.eos_token_id, attention_mask=attention_mask)
52
+ output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
53
+ for output in output_text:
54
+ extracted_result = extract_output(output)
55
+ output_data.append({"output": extracted_result})
56
+ with open(output_path, 'a', encoding='utf-8') as txt_file:
57
+ txt_file.write(extracted_result + '\n')
58
+ print(f"extracted_result:{extracted_result}")
59
+
60
+ print(f"Rewrite outputs saved to {output_path}")
examples/Stage1_RAPO/rewrite_via_instruction.sh ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ srun -p video-aigc-4 -n1 --gres=gpu:1 --cpus-per-task=8 --quotatype=spot --async \
2
+ -N1 --job-name=python \
3
+ python rewrite_via_instruction.py \
4
+ --input_path "./data/test_prompts.txt" \
5
+ --output_path "./output/rewrite_via_instruction/test_prompts.txt" \
examples/Stage1_RAPO/word_augment.py ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import random
2
+ import torch
3
+ import json
4
+ import networkx as nx
5
+ from sentence_transformers import SentenceTransformer
6
+ from torch.nn.functional import cosine_similarity
7
+ from transformers import AutoModelForCausalLM, AutoTokenizer
8
+ from tqdm import tqdm
9
+ import csv
10
+ import os
11
+ import argparse
12
+ import pandas as pd
13
+ import string
14
+ import re
15
+ import numpy as np
16
+
17
+
18
+ def open_dataset(filename):
19
+ """Load a JSON file"""
20
+ with open(filename, 'r') as file:
21
+ data = json.load(file)
22
+ return data
23
+
24
+ # Remove punctuation
25
+ def remove_punctuation(text):
26
+ return text.translate(str.maketrans('', '', string.punctuation))
27
+
28
+ # Compute similarity between two texts
29
+ def compute_similarity(text1, text2, model, embeddings_cache):
30
+ if text1 not in embeddings_cache:
31
+ embeddings_cache[text1] = model.encode(text1)
32
+ if text2 not in embeddings_cache:
33
+ embeddings_cache[text2] = model.encode(text2)
34
+ embedding1 = torch.tensor(embeddings_cache[text1]).unsqueeze(0)
35
+ embedding2 = torch.tensor(embeddings_cache[text2]).unsqueeze(0)
36
+ similarity = cosine_similarity(embedding1, embedding2).item()
37
+ return similarity
38
+
39
+
40
+ # Extract similarity score from string
41
+ def extract_sim_score(text):
42
+ match = re.search(r'sim_score=([0-9.]+)', text)
43
+ if match:
44
+ return float(match.group(1))
45
+ return 0.0
46
+
47
+ # Extract the final output from model output
48
+ def extract_output(model_output):
49
+ # Assume output contains 'Final Output: ' followed by the desired result
50
+ match = re.search(r'Final Output:\s*(.*)', model_output, re.IGNORECASE)
51
+ if match:
52
+ return match.group(1).strip()
53
+ return model_output.strip()
54
+
55
+
56
+ # Get max number of columns in a CSV file
57
+ def get_max_columns(input_csv_path):
58
+ max_columns = 0
59
+ with open(input_csv_path, 'r', encoding='utf-8') as f:
60
+ reader = csv.reader(f)
61
+ for row in reader:
62
+ max_columns = max(max_columns, len(row))
63
+ return max_columns
64
+
65
+ ### Similarity Ranking
66
+ def Similarity_Ranking(input_txt, simrank_path, SentenceTransformer_model):
67
+ """
68
+ input_txt: each line formatted as 'prefix.suffix1,suffix2,...'
69
+ simrank_path: path to output CSV
70
+ SentenceTransformer_model: model for computing similarity
71
+ """
72
+ embeddings_cache = {}
73
+
74
+ with open(input_txt, 'r', encoding='utf-8') as f, \
75
+ open(simrank_path, 'w', encoding='utf-8', newline='') as csv_file:
76
+
77
+ writer = csv.writer(csv_file, quoting=csv.QUOTE_MINIMAL)
78
+
79
+ for line in f:
80
+ line = line.strip()
81
+ if not line:
82
+ continue
83
+
84
+ # 1. Split at the first '.'
85
+ idx = line.find('.')
86
+ if idx == -1:
87
+ # Skip if no '.' is found
88
+ continue
89
+
90
+ first_part = line[:idx+1].strip()
91
+ rest = line[idx+1:].strip()
92
+
93
+ # 2. Split suffixes by comma and remove whitespace
94
+ other_parts = [part.strip() for part in rest.split(',') if part.strip()]
95
+
96
+ # 3. Combine into parts list
97
+ parts = [first_part] + other_parts
98
+
99
+ # 4. Remove punctuation from parts
100
+ clean_parts_no_punct = [parts[0]] + [
101
+ remove_punctuation(part) for part in parts[1:]
102
+ ]
103
+
104
+ # 5. Compute similarity scores
105
+ processed_with_similarity = []
106
+ before_period = clean_parts_no_punct[0]
107
+ for part in clean_parts_no_punct[1:]:
108
+ sim_score = compute_similarity(
109
+ before_period, part, SentenceTransformer_model, embeddings_cache
110
+ )
111
+ sim_score = round(sim_score, 4)
112
+ processed_with_similarity.append((part, sim_score))
113
+ print(f"processed_part: {part}, sim_score={sim_score}")
114
+
115
+ # 6. Sort by similarity descending
116
+ processed_with_similarity_sorted = sorted(
117
+ processed_with_similarity,
118
+ key=lambda x: x[1],
119
+ reverse=True
120
+ )
121
+
122
+ # 7. Format output fields
123
+ formatted_processed = [first_part]
124
+ for part, sim in processed_with_similarity_sorted:
125
+ formatted_processed.append(f"{part}, sim_score={sim}")
126
+
127
+ # 8. Write a row to the output CSV
128
+ csv_line = [first_part, rest] + formatted_processed
129
+ writer.writerow(csv_line)
130
+
131
+ print(f"Similarity Ranking completed, results saved to {simrank_path}")
132
+ return simrank_path
133
+ ### Similarity Ranking
134
+
135
+ ### Iteractively Merging
136
+ def Iteractively_Merging(simrank_path, merging_path, selected_modifiers, SIMILARITY_THRESHOLD):
137
+ max_columns = get_max_columns(simrank_path)
138
+ print(f"Max columns: {max_columns}")
139
+ simrank_file = pd.read_csv(simrank_path, header=None, names=[f"col{i}" for i in range(max_columns)], encoding='utf-8')
140
+ output_data = []
141
+ for index, row in simrank_file.iterrows():
142
+ try:
143
+ original_text = row.iloc[1]
144
+ modifiers = row.iloc[2:]
145
+ modifiers_with_scores = []
146
+ for modifier in modifiers:
147
+ if pd.isna(modifier):
148
+ continue
149
+ parts = modifier.split(", sim_score=")
150
+ if len(parts) == 2:
151
+ mod_text = parts[0].strip()
152
+ sim_score = extract_sim_score(modifier)
153
+ if sim_score >= SIMILARITY_THRESHOLD:
154
+ modifiers_with_scores.append((mod_text, sim_score))
155
+ modifiers_with_scores_sorted = sorted(modifiers_with_scores, key=lambda x: x[1], reverse=True)
156
+
157
+ current_description = original_text
158
+ processed_outputs = []
159
+ for modifier, sim_score in modifiers_with_scores_sorted:
160
+ if sim_score < SIMILARITY_THRESHOLD:
161
+ print(f"Row {index+1}: sim_score={sim_score} below threshold {SIMILARITY_THRESHOLD}, stopping inference.")
162
+ break
163
+ try:
164
+ prompt = f"""Suppose you are a Text Rewriter, and your role is to transform user-entered text into a concise, detailed description. You receive two inputs from the user: description body and relevant modifiers. Your task is to enrich the description body with relevant modifiers while retaining the description body. You should ensure that the output text is coherent, contextually relevant, and follows the same structure as the examples provided.
165
+ Examples provided:
166
+ (1) Description body: a group of dancers performing a ballet routine in a studio. The dancers are wearing ballet shoes.
167
+ Relevant modifiers: dressed in black leotards.
168
+ Output: a group of dancers performing a ballet routine in a studio. The dancers are wearing ballet shoes and are dressed in black leotards.
169
+ (2) Description body: a woman sitting at a desk and working on her laptop.
170
+ Relevant modifiers: appears to be focused on her work.
171
+ Output: a woman sitting at a desk and working on her laptop, appears to be focused on her work.
172
+ (3) Description body: they seem to be having a good time and enjoying each other's company.
173
+ Relevant modifiers: a casual and relaxed setting.
174
+ Output: They seem to be having a good time and enjoying each other's company. It appears to be a casual and relaxed setting.
175
+ (4) Description body: a woman preparing a delicious meal in her kitchen.
176
+ Relevant modifiers: cutting various fruits and vegetables on a cutting board.
177
+ Output: a woman preparing a delicious meal in her kitchen. She is seen cutting various fruits and vegetables on a cutting board and placing them on a tray.
178
+ The Description body: {current_description}, Relevant modifiers: {modifier}, Final Output:"""
179
+ inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
180
+ attention_mask = torch.ones(inputs.shape, dtype=torch.long, device=device)
181
+ outputs = merging_model.generate(
182
+ inputs,
183
+ max_new_tokens=500,
184
+ pad_token_id=tokenizer.eos_token_id,
185
+ attention_mask=attention_mask
186
+ )
187
+ output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
188
+ extracted_output = extract_output(output_text)
189
+ processed_output = f"{modifier}, sim_score={sim_score}"
190
+ processed_outputs.append(processed_output)
191
+ current_description = extracted_output
192
+ print(f"Row {index+1}, Step with modifier '{modifier}': {processed_output}")
193
+ except Exception as e:
194
+ print(f"Error during inference on row {index+1}: {e}")
195
+ continue
196
+
197
+ # Save final description
198
+ output_row = {
199
+ "original_text": original_text,
200
+ "before_period": row.iloc[0] if len(row) > 0 else "",
201
+ "final_description": current_description,
202
+ "processed_outputs": "; ".join(processed_outputs)
203
+ }
204
+ output_data.append(output_row)
205
+ with open(merging_path, 'a', encoding='utf-8') as desc_file:
206
+ desc_file.write(current_description + '\n')
207
+
208
+ except Exception as e:
209
+ print(f"Error processing row {index+1}: {e}")
210
+ continue
211
+ output_df = pd.DataFrame(output_data)
212
+ output_df.to_csv(selected_modifiers, index=False, encoding='utf-8')
213
+ return merging_path, selected_modifiers, output_data
214
+ ### Iteractively Merging
215
+
216
+
217
+
218
+ if __name__ == "__main__":
219
+ parser = argparse.ArgumentParser(description='Retrieve graph construction.')
220
+ parser.add_argument('--retrieved_words', type=str, default="./output/all_dimension.csv", help='path of retrieved modifiers')
221
+ parser.add_argument('--pretrained_SentenceTransformer', type=str, default='./ckpt/all-MiniLM-L6-v2', help='SentenceTransformer model path')
222
+ parser.add_argument('--pretrained_merging', type=str, default='./ckpt/Mistral-7B-Instruct-v0.3/', help='merging model path')
223
+ parser.add_argument('--input_path', type=str, required=True, help='input text file path')
224
+ parser.add_argument('--output_simrank', type=str, required=True, help='output ranking CSV')
225
+ parser.add_argument('--output_selected_modifiers', type=str, required=True, help='output selected modifiers CSV')
226
+ parser.add_argument('--output_interactive_merging', type=str, required=True, help='results after interactive merging')
227
+ parser.add_argument('--SIMILARITY_THRESHOLD', type=float, required=True, help='similarity threshold')
228
+
229
+ args = parser.parse_args()
230
+ SentenceTransformer_model = SentenceTransformer(args.pretrained_SentenceTransformer)
231
+ merging_model_path = args.pretrained_merging
232
+ input_path = args.input_path
233
+ retrieved_words = args.retrieved_words
234
+ simrank_path = args.output_simrank
235
+ selected_modifiers_path = args.output_selected_modifiers
236
+ merging_path = args.output_interactive_merging
237
+ SIMILARITY_THRESHOLD = args.SIMILARITY_THRESHOLD
238
+
239
+ output_folder = os.path.dirname(simrank_path)
240
+ os.makedirs(output_folder, exist_ok=True)
241
+ tokenizer = AutoTokenizer.from_pretrained(merging_model_path, padding_side="left")
242
+ tokenizer.pad_token = tokenizer.eos_token
243
+ merging_model = AutoModelForCausalLM.from_pretrained(merging_model_path, torch_dtype=torch.bfloat16, device_map="auto", offload_state_dict=False)
244
+ device = "cuda" if torch.cuda.is_available() else "cpu"
245
+
246
+
247
+ ### Similarity Ranking
248
+ simrank_path = Similarity_Ranking(retrieved_words, simrank_path, SentenceTransformer_model)
249
+ ### Similarity Ranking
250
+
251
+ ### Iteractively Merging
252
+ merging_path, selected_modifiers_path, output_data = Iteractively_Merging(simrank_path, merging_path, selected_modifiers_path, SIMILARITY_THRESHOLD)
253
+ ### Iteractively Merging
254
+
examples/Stage1_RAPO/word_augment.sh ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ input_path="./data/test_prompts.txt"
4
+ output_dir="./output/refactor/"
5
+ SIMILARITY_THRESHOLD=0.6
6
+
7
+ output_simrank="${output_dir}/simrank.csv"
8
+ output_selected_modifiers="${output_dir}/selected_modifiers.txt"
9
+ output_interactive_merging="${output_dir}/merging_reuslts.txt"
10
+
11
+ python word_augment.py\
12
+ --retrieved_words "./output/retrieve_words/retrieved_words.txt" \
13
+ --input_path "${input_path}" \
14
+ --output_simrank "${output_simrank}" \
15
+ --output_selected_modifiers "${output_selected_modifiers}" \
16
+ --output_interactive_merging "${output_interactive_merging}" \
17
+ --SIMILARITY_THRESHOLD "${SIMILARITY_THRESHOLD}" \
examples/Stage2_SSPO/examples.csv ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ captions,phys_law
2
+ A swimmer splashing in the sea water.,"Due to momentum transfer to water during strokes and kicks, splashes and waves are generated."
3
+ Pouring milk into still tea.,"Due to density difference and diffusion, milk disperses and mixes with tea."
4
+ Cloth banner hanging from wooden twig.,"Due to gravity balanced by tension, cloth banner reaches static equilibrium."
5
+ Hand shaking salt shaker.,"Due to acceleration overcoming static friction, salt grains flow out."
6
+ Peeler peels an apple.,"Due to shear force exceeding the skin’s strength, thin layers of peel are removed."
7
+ An electric beater whips cream in a bowl.,"Due to rapid mechanical agitation, cream incorporates air and thickens."
8
+ A waterfall cascades over jagged rocks.,"Due to gravitational acceleration, water flows downward and impacts surface creating turbulence."
9
+ A coffee pot pours a morning cup of joe.,"Due to gravity, liquid flows in a stream shaped by surface tension and viscosity."
10
+ Bottle crashes onto concrete floor.,"Due to gravitational fall and brittle fracture on impact, the bottle breaks and energy dissipates as sound and shards."
examples/Stage2_SSPO/phyaware_wan2.1.py ADDED
@@ -0,0 +1,364 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ import csv
3
+ import torch
4
+ from diffusers import AutoencoderKLWan, WanPipeline
5
+ from diffusers.utils import export_to_video
6
+ from pathlib import Path
7
+ import numpy as np
8
+ import cv2
9
+ import pandas as pd
10
+ from transformers import AutoModelForCausalLM, AutoTokenizer
11
+
12
+ # === VLM dependencies ===
13
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
14
+ from qwen_vl_utils import process_vision_info
15
+
16
+
17
+ # ---------------------------
18
+ # VLM-based alignment assessment
19
+ # ---------------------------
20
+ def misalignment_assessment(
21
+ qwen_vl_path: str,
22
+ video_path: str = "",
23
+ prompt: str = "",
24
+ max_new_tokens: int = 256,
25
+ device: str = "cuda"
26
+ ):
27
+ """
28
+ Use Qwen2.5-VL to assess how well the video aligns with the text description.
29
+ Return the model's response string.
30
+ """
31
+ # Load model and processor
32
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
33
+ qwen_vl_path, torch_dtype="auto", device_map="auto"
34
+ )
35
+ processor = AutoProcessor.from_pretrained(qwen_vl_path)
36
+
37
+ # Evaluation template
38
+ eval_template = f"""
39
+ Evaluate how well the video aligns with the given text prompt.
40
+ Consider whether the objects, actions, and scene described in the prompt are accurately represented in the video.
41
+ Provide a brief explanation and assign an alignment score from 1 (completely misaligned) to 5 (perfectly aligned).
42
+ (A) PROMPT: \"\"\"{prompt}\"\"\"
43
+ """
44
+
45
+ # Build messages
46
+ messages = [
47
+ {
48
+ "role": "user",
49
+ "content": [
50
+ {"type": "video", "video": video_path},
51
+ {"type": "text", "text": eval_template},
52
+ ],
53
+ }
54
+ ]
55
+
56
+ # Prepare inputs
57
+ text = processor.apply_chat_template(
58
+ messages, tokenize=False, add_generation_prompt=True
59
+ )
60
+ image_inputs, video_inputs = process_vision_info(messages)
61
+ inputs = processor(
62
+ text=[text],
63
+ images=image_inputs,
64
+ videos=video_inputs,
65
+ padding=True,
66
+ return_tensors="pt",
67
+ )
68
+ inputs = inputs.to(device)
69
+
70
+ # Generate output
71
+ generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
72
+ generated_ids_trimmed = [
73
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
74
+ ]
75
+ output_text_list = processor.batch_decode(
76
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
77
+ )
78
+
79
+ # Return string (take the first item if it's a list)
80
+ return output_text_list[0] if isinstance(output_text_list, list) and len(output_text_list) > 0 else ""
81
+
82
+
83
+ # ---------------------------
84
+ # Wan pipeline and video generation
85
+ # ---------------------------
86
+ def load_model(model_id: str) -> WanPipeline:
87
+ """
88
+ Load WanPipeline with its VAE.
89
+ """
90
+ vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
91
+ pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
92
+ pipe.to("cuda")
93
+ return pipe
94
+
95
+
96
+ def generate_single_video(
97
+ pipe: WanPipeline,
98
+ prompt: str,
99
+ output_file_path: Path,
100
+ negative_prompt: str,
101
+ seed: int = 1423
102
+ ) -> None:
103
+ """
104
+ Generate a single video and save it to disk.
105
+ """
106
+ generator = torch.Generator(device="cuda").manual_seed(seed)
107
+
108
+ print(f"▶️ Generating video: {prompt}")
109
+
110
+ output = pipe(
111
+ prompt=prompt.strip(),
112
+ negative_prompt=negative_prompt,
113
+ height=480,
114
+ width=832,
115
+ num_frames=81,
116
+ guidance_scale=5.0,
117
+ generator=generator
118
+ ).frames[0]
119
+
120
+ export_to_video(output, str(output_file_path), fps=15)
121
+ print(f"✅ Saved: {output_file_path}")
122
+
123
+
124
+ def extract_optical_flow(video_path: str, sample_interval_sec: float = 0.5) -> list:
125
+ """
126
+ Sample frames from the video and compute mean optical flow between adjacent samples.
127
+ """
128
+ cap = cv2.VideoCapture(video_path)
129
+ if not cap.isOpened():
130
+ raise IOError(f"Cannot open video: {video_path}")
131
+
132
+ frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
133
+ fps = cap.get(cv2.CAP_PROP_FPS)
134
+ sample_every = int(fps * sample_interval_sec) if fps and fps > 0 else 1
135
+
136
+ frames = []
137
+ for i in range(0, frame_count, sample_every):
138
+ cap.set(cv2.CAP_PROP_POS_FRAMES, i)
139
+ ret, frame = cap.read()
140
+ if ret:
141
+ frames.append(frame)
142
+ else:
143
+ break
144
+ cap.release()
145
+
146
+ flows = []
147
+ for i in range(len(frames) - 1):
148
+ prev_gray = cv2.cvtColor(frames[i], cv2.COLOR_BGR2GRAY)
149
+ next_gray = cv2.cvtColor(frames[i + 1], cv2.COLOR_BGR2GRAY)
150
+ flow = cv2.calcOpticalFlowFarneback(
151
+ prev_gray, next_gray, None,
152
+ pyr_scale=0.5, levels=3, winsize=15,
153
+ iterations=3, poly_n=5, poly_sigma=1.2, flags=0
154
+ )
155
+ mean_flow_x = float(np.mean(flow[..., 0]))
156
+ mean_flow_y = float(np.mean(flow[..., 1]))
157
+ flows.append((mean_flow_x, mean_flow_y))
158
+ return flows
159
+
160
+
161
+ # ---------------------------
162
+ # Physics consistency + VLM alignment fusion + prompt refinement
163
+ # ---------------------------
164
+ def evaluate_physical_consistency(
165
+ flows: list,
166
+ pysical_rule: str,
167
+ text_prompt: str,
168
+ instruct_llm_path: str,
169
+ vlm_alignment: str = ""
170
+ ) -> tuple:
171
+ """
172
+ Physics consistency analysis + VLM alignment assessment fusion + prompt refinement.
173
+ Returns: (mismatch_summary, refined_prompt)
174
+ """
175
+ model = AutoModelForCausalLM.from_pretrained(
176
+ instruct_llm_path,
177
+ torch_dtype="auto",
178
+ device_map="auto"
179
+ )
180
+ tokenizer = AutoTokenizer.from_pretrained(instruct_llm_path)
181
+
182
+ # Phase 1: Physics plausibility check based on optical flow
183
+ physics_check_prompt = (
184
+ "You are an expert in physics and motion analysis. I am providing you with a prompt for generating a video "
185
+ "and the optical-flow motion statistics extracted from that generated video.\n\n"
186
+ f"Prompt for the video: {text_prompt}\n"
187
+ f"Sequence of average optical flow vectors (x, y) per sample: {flows}\n\n"
188
+ "Task: Judge whether the motion is physically plausible, referencing laws such as inertia, conservation of momentum, "
189
+ "buoyancy, and continuous force application. Provide a concise final conclusion only (no process), e.g., "
190
+ "\"Sudden global reversals without external force violate inertia\" or \"No obvious physical inconsistency\".\n"
191
+ "Examples:\n"
192
+ "Response 1: Objects or liquids have sudden reverse motion between adjacent frames; if there is no external force explanation "
193
+ "(such as secondary collision, bounce), this sudden acceleration does not conform to the law of inertia; "
194
+ "in particular, liquids or debris usually do not have overall reverse flow.\n"
195
+ "Response 2: Based on the extracted optical flow, there are no obvious physical inconsistencies in this video. "
196
+ "The motion is smooth, directional, and realistic in magnitude and trend. There are no sudden reversals of direction or unrealistic oscillations."
197
+ )
198
+
199
+ messages = [
200
+ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
201
+ {"role": "user", "content": physics_check_prompt}
202
+ ]
203
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
204
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
205
+
206
+ generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
207
+ generated_ids = [
208
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
209
+ ]
210
+ physics_response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
211
+
212
+ # Phase 2: Fuse VLM alignment + physics issues -> rewrite prompt
213
+ rewrite_prompt = (
214
+ "You are a prompt engineering expert for diffusion-based text-to-video generation. "
215
+ "Refine the prompt so that the next generated video better matches real-world physics and the intended semantics.\n\n"
216
+ f"Related physical rule to obey: {pysical_rule}\n"
217
+ f"Original prompt: {text_prompt}\n"
218
+ "Detected mismatches:\n"
219
+ f"- Optical-flow-based physics analysis: {physics_response}\n"
220
+ f"- VLM alignment assessment (semantic/temporal/object-action alignment): {vlm_alignment}\n\n"
221
+ "Requirements for the refined prompt:\n"
222
+ "- Describe the expected video content directly; do not mention rules, analysis, or this instruction.\n"
223
+ "- Keep it under 120 words.\n"
224
+ "- Preserve the core intent but explicitly constrain motions, forces, object states, timings, and camera if helpful."
225
+ )
226
+
227
+ rewrite_messages = [
228
+ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
229
+ {"role": "user", "content": rewrite_prompt}
230
+ ]
231
+ text = tokenizer.apply_chat_template(rewrite_messages, tokenize=False, add_generation_prompt=True)
232
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
233
+ generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
234
+ generated_ids = [
235
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
236
+ ]
237
+ refined_prompt = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
238
+
239
+ # Summarize mismatches for logging
240
+ mismatch_summary = f"[Physics] {physics_response} || [VLM] {vlm_alignment}"
241
+ return mismatch_summary, refined_prompt
242
+
243
+
244
+ # ---------------------------
245
+ # Main workflow: unchanged logic + call VLM assessment
246
+ # ---------------------------
247
+ if __name__ == "__main__":
248
+ # ==== Centralized checkpoint and path configuration ====
249
+ WAN_MODEL_ID = "../../ckpt/Wan2.1-T2V-1.3B-Diffusers" # Wan T2V checkpoint
250
+ INSTRUCT_LLM_PATH = "../../ckpt//Qwen2.5-7B-Instruct" # Instruction-tuned LLM for physics/rewrite
251
+ QWEN_VL_PATH = ../../ckpt//qwen2.5-vl-7B-instruct" # VLM for alignment assessment
252
+
253
+ # Output and data
254
+ OUTPUT_DIR = Path("./results/examples_refined/")
255
+ OUTPUT_LOG = Path("./results/examples_refined/refined_prompts.csv")
256
+ CSV_PATH = Path("examples.csv")
257
+
258
+ # Negative prompt (not a checkpoint path)
259
+ NEGATIVE_PROMPT = (
260
+ "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, "
261
+ "images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, "
262
+ "incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, "
263
+ "misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
264
+ )
265
+
266
+ # Prepare I/O
267
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
268
+ OUTPUT_LOG.parent.mkdir(parents=True, exist_ok=True)
269
+
270
+ # Load T2V pipeline
271
+ pipe = load_model(WAN_MODEL_ID)
272
+
273
+ # Read input CSV
274
+ df = pd.read_csv(CSV_PATH)
275
+
276
+ # Number of refinement iterations per prompt
277
+ num_refine_iterations = 5
278
+
279
+ # === Load existing log (if any) for resume capability ===
280
+ # Log columns: base_name, iter_idx, video_file, mismatch, refined_prompt
281
+ if OUTPUT_LOG.exists() and OUTPUT_LOG.stat().st_size > 0:
282
+ log_df = pd.read_csv(OUTPUT_LOG, header=None, names=["base_name", "iter_idx", "video_file", "mismatch", "refined_prompt"])
283
+ else:
284
+ log_df = pd.DataFrame(columns=["base_name", "iter_idx", "video_file", "mismatch", "refined_prompt"])
285
+
286
+ # Main loop over rows
287
+ for idx, row in df.iterrows():
288
+ base_name = f"{idx + 1:07d}"
289
+ PHYSICAL_RULE = row['phys_law']
290
+ orig_prompt = row['captions']
291
+
292
+ # Recover latest refined prompt if previous iterations exist
293
+ done_rows = log_df[log_df["base_name"] == base_name].sort_values("iter_idx")
294
+ if not done_rows.empty:
295
+ last_iter = int(done_rows["iter_idx"].iloc[-1])
296
+ prompt = str(done_rows["refined_prompt"].iloc[-1])
297
+ start_iter = last_iter + 1
298
+ print(f"\n=== {base_name} has {last_iter} recorded iterations, resuming from {start_iter} ===")
299
+ else:
300
+ prompt = orig_prompt
301
+ start_iter = 1
302
+ print(f"\n=== Processing Row {base_name} ===")
303
+
304
+ for i in range(start_iter, num_refine_iterations + 1):
305
+ video_file = OUTPUT_DIR / f"{base_name}_r{i}.mp4"
306
+
307
+ # Skip if this iteration already exists in the log (ensure the prompt chain stays consistent)
308
+ if not done_rows.empty and i in set(done_rows["iter_idx"].astype(int).tolist()):
309
+ prompt_i = str(done_rows[done_rows["iter_idx"] == i]["refined_prompt"].iloc[0])
310
+ prompt = prompt_i
311
+ print(f"[ Skip ] {base_name} iteration {i} already in log. Skipping.")
312
+ continue
313
+
314
+ # If the video exists but no log entry, evaluate and log directly
315
+ if video_file.exists():
316
+ print(f"[ Found ] Existing video: {video_file}, skipping generation and evaluating directly.")
317
+ try:
318
+ flows = extract_optical_flow(str(video_file))
319
+ vlm_text = misalignment_assessment(
320
+ qwen_vl_path=QWEN_VL_PATH,
321
+ video_path=str(video_file),
322
+ prompt=orig_prompt,
323
+ max_new_tokens=256,
324
+ device="cuda"
325
+ )
326
+ mismatch, refined_prompt = evaluate_physical_consistency(
327
+ flows, PHYSICAL_RULE, orig_prompt, INSTRUCT_LLM_PATH, vlm_alignment=vlm_text
328
+ )
329
+ except Exception as e:
330
+ print(f"[ Warn ] Evaluation failed for existing video: {e}. Skipping this iteration.")
331
+ continue
332
+
333
+ print(f"[ Iter {i} ] Mismatch: {mismatch}")
334
+ print(f"[ Iter {i} ] Refined Prompt: {refined_prompt}")
335
+
336
+ with open(OUTPUT_LOG, mode='a', newline='', encoding='utf-8') as log_file:
337
+ writer = csv.writer(log_file)
338
+ writer.writerow([base_name, i, str(video_file), mismatch, refined_prompt])
339
+
340
+ prompt = refined_prompt # Use for the next iteration
341
+ continue
342
+
343
+ # Normal path: generate -> optical flow -> VLM assess -> fuse -> log
344
+ generate_single_video(pipe, prompt, video_file, NEGATIVE_PROMPT)
345
+ flows = extract_optical_flow(str(video_file))
346
+ vlm_text = misalignment_assessment(
347
+ qwen_vl_path=QWEN_VL_PATH,
348
+ video_path=str(video_file),
349
+ prompt=orig_prompt,
350
+ max_new_tokens=256,
351
+ device="cuda"
352
+ )
353
+ mismatch, refined_prompt = evaluate_physical_consistency(
354
+ flows, PHYSICAL_RULE, orig_prompt, INSTRUCT_LLM_PATH, vlm_alignment=vlm_text
355
+ )
356
+
357
+ print(f"[ Iter {i} ] Mismatch: {mismatch}")
358
+ print(f"[ Iter {i} ] Refined Prompt: {refined_prompt}")
359
+
360
+ with open(OUTPUT_LOG, mode='a', newline='', encoding='utf-8') as log_file:
361
+ writer = csv.writer(log_file)
362
+ writer.writerow([base_name, i, str(video_file), mismatch, refined_prompt])
363
+
364
+ prompt = refined_prompt # Use for the next iteration
requirement.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ networkx==3.4.2
2
+ sentence-transformers==4.1.0
3
+ tqdm==4.64.0
4
+ transformers==4.51.3
5
+ pandas==2.2.3
6
+ protobuf==6.30.2
7
+ accelerate==1.6.0
requirements.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio==5.49.1
2
+ gradio-client==1.13.3
3
+ httpx>=0.24.1,<1.0
4
+ ruff>=0.9.3
5
+ huggingface_hub>=0.20.0
6
+ sentence-transformers>=2.0.0
7
+ sentencepiece==0.2.1
8
+ torch==2.5.1
9
+ torchvision==0.20.1
10
+ torchaudio==2.5.1
11
+ networkx==3.4.2
12
+ tqdm==4.64.0
13
+ transformers==4.51.3
14
+ pandas==2.2.3
15
+ protobuf==6.30.2
16
+ accelerate==1.6.0
17
+ spaces