Spaces:

Anson818
/

CCAI_TEST

Sleeping

App Files Files Community

Anson818 commited on Nov 11, 2025

Commit

9074cec

verified ·

1 Parent(s): 918256e

Update app.py

Browse files

Files changed (1) hide show

app.py +60 -5

app.py CHANGED Viewed

@@ -36,12 +36,60 @@ except Exception as e:
 # This is the long prompt from your script
 prompt1 = """Role:
  You are an expert computer vision analyst that specializes in converting videos into precise, exhaustive, and purely visual scene descriptions.
-... (Your full Gemini prompt) ...
 Final Output Rule:
- Produce a single, continuous, structured description following all the above rules.
  Do not summarize, infer meaning, or include audio elements.
  The output must be factual, visual, chronological, and exhaustive."""
 # --- 3. The Main Workflow Function for Gradio ---
 def generate_sfx(video_path):
     """
@@ -74,9 +122,16 @@ def generate_sfx(video_path):
     # --- Step 2: Llama Prompt Generation ---
     try:
-        your_prompt = f"""Identify the suitable audio effects based on the given video transcript...
-        ... (Your full Llama prompt) ...
-        Transcript: {transcript}"""
         completion = llama_client.chat.completions.create(
           model="meta/llama-3.1-405b-instruct",

 # This is the long prompt from your script
 prompt1 = """Role:
  You are an expert computer vision analyst that specializes in converting videos into precise, exhaustive, and purely visual scene descriptions.
+Primary Objective:
+Analyze the provided video and generate a detailed, chronological description of everything visually occurring in the footage. Focus entirely on what can be seen, not heard.
+Core Instructions:
+Follow these instructions exactly:
+Visual-Only Focus
+Describe only what is visible on-screen.
+Ignore all sounds, dialogue, narration, or music.
+Include on-screen text only if it appears as a visible object (e.g., sign, label, subtitle).
+Chronological Detailing
+Describe events strictly in the order they appear.
+Use clear temporal markers such as “At the beginning…”, “Next…”, “Then…”, “After that…”, “Finally…”
+Comprehensive Visual Content
+Describe people, objects, settings, environments, lighting, colors, positions, and movements.
+Include camera actions (pans, tilts, zooms, cuts, transitions).
+Capture facial expressions, gestures, and body posture changes if visible.
+Objectivity and Precision
+Avoid interpretation, emotion, or speculation.
+Describe only observable facts (e.g., say “The person raises their right arm,” not “The person waves hello”).
+Level of Detail
+Provide enough visual information for someone to recreate or storyboard the entire scene.
+Include every key visual or motion change.
+Output Formatting:
+Use the following structured format:
+[Timestamp or Sequence Indicator]
+Detailed description of what is visually happening.
+Example:
+0:00–0:04 — A man in a dark blue jacket walks across a street. A red car passes behind him.
+0:05–0:09 — The camera tilts upward to show a tall building with glass windows. The sky is cloudy.
+0:10–0:13 — The man stops, looks up, and adjusts the strap of a black backpack.
+If timestamps are unavailable, use sequence-based ordering (e.g., “Scene 1,” “Scene 2,” etc.).
 Final Output Rule:
+Produce a single, continuous, structured description following all the above rules.
  Do not summarize, infer meaning, or include audio elements.
  The output must be factual, visual, chronological, and exhaustive."""
 # --- 3. The Main Workflow Function for Gradio ---
 def generate_sfx(video_path):
     """
     # --- Step 2: Llama Prompt Generation ---
     try:
+        your_prompt = f"""Identify the suitable audio effects based on the given video transcript and
+    generate a suitable and detailed prompt for each audio effects for another audio generating AI
+    model to generate the audio effects. Note that the duration of each audio should be within 2-10
+    seconds. Only include the prompts for generating the sound effects
+    and do not include any other text, such as timestamps. Separate the prompt and the duration for
+    each audio effects with a new line. Output in the following format for each prompt and duration:
+    [prompt1];[duration1] (new line) [prompt2];[duration2] etc. only include the number of the duration
+    in [duration] No other text should be included in
+    the output. Do make the prompts with details, such as the intensity, feeling etc according to the
+    video transcript so that the high quality and suitable sound can be generated. Transcript: {transcript}"""
         completion = llama_client.chat.completions.create(
           model="meta/llama-3.1-405b-instruct",