Spaces:

JackIsNotInTheBox
/

Generate_Audio_for_Video

Running on Zero

App Files Files Community

BoxOfColors commited on 5 days ago

Commit

36c1b45

1 Parent(s): dd33d76

Update UI descriptions: scene-type guidance instead of duration

Browse files

Files changed (1) hide show

app.py +19 -14

app.py CHANGED Viewed

@@ -573,11 +573,11 @@ with gr.Blocks(title="Video-to-Audio Generation") as demo:
     gr.Markdown(
         "# Video-to-Audio Generation\n"
         "Choose a model and upload a video to generate synchronized audio.\n\n"
-        "| Model | Sample rate | Optimal duration | Notes |\n"
-        "|-------|------------|-----------------|-------|\n"
-        "| **TARO** | 16 kHz | 8.2 s | Video-only, sliding window for longer clips |\n"
-        "| **MMAudio** | 44.1 kHz | 8 s | Text prompt supported |\n"
-        "| **HunyuanFoley** | 48 kHz | up to 15 s | Text-guided foley, highest fidelity |"
     )
     with gr.Tabs():
@@ -587,9 +587,10 @@ with gr.Blocks(title="Video-to-Audio Generation") as demo:
         # ---------------------------------------------------------- #
         with gr.Tab("TARO"):
             gr.Markdown(
-                "**TARO** — Video-conditioned diffusion (ICCV 2025). No text prompt needed. "
-                "8.192 s model window; longer videos are split into overlapping segments "
-                "and stitched with a crossfade."
             )
             with gr.Row():
                 with gr.Column():
@@ -646,9 +647,10 @@ with gr.Blocks(title="Video-to-Audio Generation") as demo:
         with gr.Tab("MMAudio"):
             gr.Markdown(
                 "**MMAudio** — Multimodal flow-matching (CVPR 2025). "
-                "Supports a text prompt for additional control. "
-                "Native window is 8 s at 44.1 kHz. "
-                "Duration slider lets you control how many seconds are processed."
             )
             with gr.Row():
                 with gr.Column():
@@ -698,9 +700,12 @@ with gr.Blocks(title="Video-to-Audio Generation") as demo:
         # ---------------------------------------------------------- #
         with gr.Tab("HunyuanFoley"):
             gr.Markdown(
-                "**HunyuanVideo-Foley** (Tencent Hunyuan). "
-                "Professional-grade text-guided foley at 48 kHz, up to 15 s. "
-                "Requires a text prompt describing the desired sound."
             )
             with gr.Row():
                 with gr.Column():

     gr.Markdown(
         "# Video-to-Audio Generation\n"
         "Choose a model and upload a video to generate synchronized audio.\n\n"
+        "| Model | Best for | Avoid for |\n"
+        "|-------|----------|-----------|\n"
+        "| **TARO** | Natural, physics-driven impacts — footsteps, collisions, water, wind, crackling fire. Excels when the sound is tightly coupled to visible motion without needing a text description. | Dialogue, music, or complex layered soundscapes where semantic context matters. |\n"
+        "| **MMAudio** | Mixed scenes where you want both visual grounding *and* semantic control via a text prompt — e.g. a busy street scene where you want to emphasize the rain rather than the traffic. Great for ambient textures and nuanced sound design. | Pure impact/foley shots where TARO's motion-coupling would be sharper, or cinematic music beds. |\n"
+        "| **HunyuanFoley** | Cinematic foley requiring high fidelity and explicit creative direction — dramatic SFX, layered environmental design, or any scene where you have a clear written description of the desired sound palette. | Quick one-shot clips where you don't want to write a prompt, or raw impact sounds where timing precision matters more than richness. |"
     )
     with gr.Tabs():
         # ---------------------------------------------------------- #
         with gr.Tab("TARO"):
             gr.Markdown(
+                "**TARO** — Video-conditioned diffusion (ICCV 2025). No text prompt needed — "
+                "sound is derived entirely from visual motion. "
+                "Best for scenes with clear physics-driven events: footsteps, impacts, splashing water, "
+                "crackling fire, rustling leaves, machinery. The model learns timing directly from the video."
             )
             with gr.Row():
                 with gr.Column():
         with gr.Tab("MMAudio"):
             gr.Markdown(
                 "**MMAudio** — Multimodal flow-matching (CVPR 2025). "
+                "Combines visual grounding with optional text guidance, making it the most flexible choice. "
+                "Best for mixed or ambiguous scenes — busy environments, nature montages, abstract visuals — "
+                "where a short prompt lets you steer which element of the scene to emphasise "
+                "(e.g. *'heavy rain'* over a street scene to suppress traffic noise)."
             )
             with gr.Row():
                 with gr.Column():
         # ---------------------------------------------------------- #
         with gr.Tab("HunyuanFoley"):
             gr.Markdown(
+                "**HunyuanVideo-Foley** (Tencent Hunyuan, 2025). "
+                "Highest-fidelity model for cinematic and creative foley. "
+                "Best for scenes that call for rich, layered sound design — dramatic SFX, "
+                "complex environments (crowd + rain + distant thunder), or any clip where you have "
+                "a clear creative vision you can describe in a prompt. "
+                "Requires a text prompt."
             )
             with gr.Row():
                 with gr.Column():