MultiAgent-System-for-Screenplay-Creation

Runtime error

App Files Files Community

Loren1214 commited on Jun 8

Commit

5eeb352

verified ·

1 Parent(s): 90b6539

Update README.md

Browse files

Files changed (1) hide show

README.md +112 -96

README.md CHANGED Viewed

@@ -12,139 +12,155 @@ license: mit
 tag: agent-demo-track
 ---
-# Scriptura: A Multi-Agent System for Screenplay Creation and Editing
-The explanation **video** is available at: https://www.youtube.com/watch?v=I0201ruB1Uo&ab_channel=3DLabFactory.
-The screenplay used in the video as sample is available at: https://www.studiobinder.com/blog/best-free-movie-scripts-online/
 ## Introduction
 **Scriptura** is a multi-agent AI framework based on HF-SmolAgents that streamlines the creation of screenplays, storyboards, and soundtracks by automating the stages of analysis, summarization, and multimodal enrichment—freeing authors to focus on pure creativity.
 At its heart:
-- **Qwen3-32B** serves as the primary orchestrating agent, coordinating workflows and managing high-level reasoning across the system.
-- **Gemma-3-27B-IT** acts as a specialized assistant for multimodal tasks, supporting both text and audio inputs to refine narrative elements and prepare them for downstream generation.
 For media generation, Scriptura integrates:
-- **MusicGen** models (per the AudioCraft MusicGen specification), deployed via Hugging Face Spaces, enabling the agent to produce original soundtracks and sound effects from text prompts or combined text + audio samples.
-- **FLUX (black-forest-labs/FLUX.1-dev)** for on-the-fly image creation, ideal for storyboards, concept art, and visual references that seamlessly tie into the narrative flow.
 Optionally, Scriptura can query external sources (e.g., via a DuckDuckGo API integration) to pull in reference scripts, sound samples, or research materials, ensuring that every draft is not only creatively rich but also contextually informed.
 ## Agent Capabilities
-**Input File Parsing**
-:  - **Formats accepted**: `TXT`, `PDF`, `DOCX`, `JPEG/PNG`, `MP3/WAV`
-   - **Process**: PDF/DOCX → plain text; OCR on images; speech-to-text on audio.
-   - **Why it matters**: Provides structured input for all downstream modules.
-**Overall Plot Summary**
-:  - **Model**: `DeepSeek-R1`
-   - **Output**: 4–6 sentence summary of main narrative threads (timeframe, tone).
-   - **Mechanics**: API calls to DeepSeek with retry logic for improved coherence.
-**Entity & Theme Extraction**
-:  - **Technique**: Named Entity Recognition (via **DeepSeek**)
-   - **Extracts**: Characters, locations, key events, recurring themes, narrative tone.
-   - **Output**: JSON/CSV + ~5-sentence abstract.
-**Rights & Licensing Verification**
-:  - **Web Search ON**: Queries DuckDuckGo API → fetch license info if match.
-   - **Web Search OFF**: May recognize very famous works internally (e.g. “Harry Potter”) but not guaranteed.
-   - **If no match & search OFF**: No licensing check.
-**Image Generation (Storyboard & Concept Art)**
-:  - **Model**: `FLUX (black-forest-labs/FLUX.1-dev)`
-   - **Trigger**: “Generate Image” / storyboard phase.
-   - **Process**: DeepSeek crafts cinematic prompt → FLUX returns PNG/JPEG + caption.
-**Audio Generation (Music & Sound Effects)**
-:  - **Model**: `MusicGen (facebook/musicgen-melody)`
-   - **Trigger**: “Generate Audio.”
-   - **Process**: Send prompt → receive MP3/WAV (standalone audio, no text/images).
-**In-Depth Analysis of Key Points**
-:  - **Extracts**:
-     - Characters (role, gender, description)
-     - Locations (interior/exterior, period, geography)
-     - Plot Points (crucial narrative beats via Story Understanding models)
-   - **Extras**: Semantic toponym extraction → internal scene maps; detect transitions (“Suddenly,” “Meanwhile”).
-**Optional Web Search**
-:  - **Checkbox** toggles DuckDuckGo API lookups.
-   - **If Enabled**: search preconfigured sites (free & paid) for scripts, sound effects.
-   - **Output**: List of links + short summaries.
 ---
 ## Agent Flow
-```mermaid
-flowchart LR
-    A[Start Agent] --> B[Load Input (text, image, audio)]
-    B --> C[Preprocessing: PDF/DOCX → text, OCR, audio transcription]
-    C --> D[Generate Plot Summary (DeepSeek)]
-    D --> E[Extract Entities & Themes (DeepSeek)]
-    E --> F {Web Search Enabled?}
-    F -->|Yes| G[Web Search via DuckDuckGo API]
-    F -->|No| H[Continue Offline Analysis]
-    H --> I[Rights & Licensing Check]
-    I --> J[Deep Analysis: characters, locations, plot points]
-    J --> K {Image Generation Requested?}
-    K -->|Yes| L[API Call to FLUX for storyboard/concept art]
-    K -->|No| M[Skip Image Generation]
-    M --> N {Audio Generation Requested?}
-    N -->|Yes| O[API Call to MusicGen for audio tracks]
-    N -->|No| P[Skip Audio Generation]
-    L & O --> Q[Final Output: text, JSON/CSV, images, audio]
 ```
 ---
-## Deployment & Access and the Code Overview
 ---
 ## Use Cases
 **Independent Writer**
-:  - Upload a screenplay and quickly get a summary, a list of characters, and locations.
-   - Create visual storyboards of key narrative moments via FLUX (PNG/JPEG outputs).
-   - Generate brief soundtracks or sound effects to accompany script presentations (MP3/WAV).
 **Film Production Company**
-:  - Import multiple screenplays (PDF, DOCX) and automatically receive reports on characters, locations, and potential copyright issues.
-   - Use the web search feature to find reference scripts or specific sound effects from free/paid sources.
-   - Develop visual storyboards and audio prototypes to share with directors, artists, and investors.
 **Translation and Adaptation Agency**
-:  - Upload foreign-language scripts and obtain a structured text version with extracted entities (JSON/CSV).
-   - Generate contextual images for cultural adaptation (e.g., images matching the original setting via FLUX).
-   - Produce reference audio via MusicGen to test culturally appropriate music for the target audience.
 **Digital Humanities Course**
-:  - Demonstrate how to build a text-mining tool applied to performing arts, combining NLP, image, and audio pipelines.
-   - Allow students to analyze real scripts, generate abstracts, scene maps, and visual/audio prototypes in a hands-on environment.
-   - Explore Transformer models (DeepSeek), OCR, speech-to-text, and AI-driven media generation as part of the curriculum.
 ---
-## Credits
----
-## Acknowledgements
 ---
-### Contributors:
-- Code development and implementation made by **luke9705**;
-- Ideas creation, testing and videomaking conducted by **OrianIce**;
-- Research and testing by **Loren1214**;
-- Code revisions by **DDPM**.
----
-### Sources
-- Russell, S., & Norvig, P. (2021). *Artificial Intelligence: A Modern Approach* (3rd ed.). Pearson.
-- Cambria, E., & White, B. (2014). *Jumping NLP Curves: A Review of Natural Language Processing Research*. IEEE Computational Intelligence Magazine, 9(2), 48–57.
-- Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., … & Sutskever, I. (2022). *Hierarchical Text-Conditional Image Generation with CLIP Latents*. arXiv preprint arXiv:2204.06125.

 tag: agent-demo-track
 ---
+# Scriptura: A MultiAgent System for Screenplay Creation and Editing
+The explanation video is available [here](https://www.youtube.com/watch?v=I0201ruB1Uo)
+The screenplay used in the video as sample is available [here](https://www.studiobinder.com/blog/best-free-movie-scripts-online/)
 ## Introduction
 **Scriptura** is a multi-agent AI framework based on HF-SmolAgents that streamlines the creation of screenplays, storyboards, and soundtracks by automating the stages of analysis, summarization, and multimodal enrichment—freeing authors to focus on pure creativity.
 At its heart:
+* Qwen3-32B serves as the primary orchestrating agent, coordinating workflows and managing high-level reasoning across the system.
+* Gemma-3-27B-IT acts as a specialized assistant for multimodal tasks, supporting both text and audio inputs to refine narrative elements and prepare them for downstream generation.
 For media generation, Scriptura integrates:
+* MusicGen models (per the AudioCraft MusicGen specification), deployed via Hugging Face Spaces, enabling the agent to produce original soundtracks and sound effects from text prompts or combined text + audio samples.
+* FLUX (black-forest-labs/FLUX.1-dev) for on-the-fly image creation, ideal for storyboards, concept art, and visual references that seamlessly tie into the narrative flow.
 Optionally, Scriptura can query external sources (e.g., via a DuckDuckGo API integration) to pull in reference scripts, sound samples, or research materials, ensuring that every draft is not only creatively rich but also contextually informed.
+---
 ## Agent Capabilities
+Scriptura provides a rich set of agents and tools to cover the full screenplay production and enrichment pipeline:
+- **Text Analysis & Summarization**
+  - Automatically extracts key themes, character arcs, and plot points
+  - Segments and summarizes scenes for rapid iteration
+- **Multimodal Ingestion**
+  - Supports PDF, DOCX, ODT, TXT and image uploads
+  - Transcribes audio files using OpenAI Whisper
+- **Image Generation**
+  - On-the-fly storyboard and concept art creation via FLUX (black-forest-labs/FLUX.1-dev)
+- **Audio Generation**
+  - Produces original soundtracks and SFX with MusicGen (AudioCraft spec)
+  - Allows sample-conditioned audio generation
+- **Captioning & Metadata**
+  - Auto-generates captions and descriptions for images using Gemma-3-27B-IT
+- **Optional Web Research**
+  - Queries DuckDuckGo to fetch example scripts, sound samples, or contextual references
 ---
 ## Agent Flow
+Here’s an example flow demonstrating how you could use the agent.
+<img alt="Flowchart" src="https://www.canva.com/design/DAGphLlng2I/MZ2cOAnS520rFtnhTP5H6A/view?utm_content=DAGphLlng2I&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hca1222039d" width="600"/>
+![img.png](img.png)
+---
+## Code Overview
+```bash
+.
+├── app.py               # Entry point: defines Gradio interface and routing logic
+├── system_prompt.txt    # System-level prompt template for the CodeAgent
+├── requirements.txt     # Python dependencies (Gradio, SmolAgents, OpenAI, etc.)
+└── README.md            # Project documentation
 ```
+* **app.py**
+  * **Agent** class: loads Qwen3-32B model, registers all tools
+  * **respond()**: orchestrates between Gradio inputs and CodeAgent
+  * Decorated `@tool` functions for image download, media generation, transcription, captioning
+  * Gradio `ChatInterface` setup with text/file support and “Enable web search” toggle
+* **system\_prompt.txt**
+  * Injects the agent’s “way of thinking,” including reasoning structure and error handling
+* **requirements.txt**
+  * Lists all required libraries (Gradio, SmolAgents, OpenAI, HuggingFace, PDFPlumber, etc.)
 ---
+## Deployment & Access
+### Hugging Face Spaces
+1. Include `app.py`, `system_prompt.txt`, and `requirements.txt` in the root of your Space.
+2. Configure `OPENAI_API_KEY` and `HF_TOKEN` as Secrets in your Space’s settings.
+3. Make sure the Space is set to use **Python 3.10 or higher**.
+4. Select **Gradio** as the SDK (version 5.32.1).
+5. Pin or share the Space link to collaborate with your team.
+> **Note:** If you choose to clone this repository and run it locally, make sure to set your own `OPENAI_API_KEY` and `HF_TOKEN` environment variables before launching.
 ---
 ## Use Cases
 **Independent Writer**
+* Upload a screenplay and quickly get a summary, a list of characters, and locations.
+* Create visual storyboards of key narrative moments via FLUX (PNG/JPEG outputs).
+* Generate brief soundtracks or sound effects to accompany script presentations (MP3/WAV).
 **Film Production Company**
+* Import multiple screenplays (PDF, DOCX) and automatically receive reports on characters, locations, and potential copyright issues.
+* Use the web search feature to find reference scripts or specific sound effects from free/paid sources.
+* Develop visual storyboards and audio prototypes to share with directors, artists, and investors.
 **Translation and Adaptation Agency**
+* Upload foreign-language scripts and obtain a structured text version with extracted entities (JSON/CSV).
+* Generate contextual images for cultural adaptation (e.g., images matching the original setting via FLUX).
+* Produce reference audio via MusicGen to test culturally appropriate music for the target audience.
 **Digital Humanities Course**
+* Demonstrate how to build a text-mining tool applied to performing arts, combining NLP, image, and audio pipelines.
+* Allow students to analyze real scripts, generate abstracts, scene maps, and visual/audio prototypes in a hands-on environment.
+* Explore Transformer models (DeepSeek), OCR, speech-to-text, and AI-driven media generation as part of the curriculum.
 ---
+## Contributors:
+* Code development and implementation made by luke9705;
+* Ideas creation, testing and videomaking conducted by OrianIce;
+* Research and testing by Loren1214;
+* Code revisions by DDPM.
 ---
+## Sources
+The following libraries, models, and tools power Scriptura’s agents and multimodal capabilities:
+- **Qwen3-32B** – primary orchestrating LLM for high-level reasoning and workflow management
+- **Gradio** – interactive web UI framework
+- **smolagents** – lightweight multi-agent orchestrator from Hugging Face
+- **huggingface_hub** – model & dataset management
+- **duckduckgo-search** – optional web research integration
+- **openai** – Whisper transcription, GPT-based reasoning
+- **anthropic** – Claude-style LLM support
+- **pdfplumber** – PDF text extraction
+- **docx2txt** – DOCX parsing
+- **odfpy** – ODT parsing
+- **pandas** – data handling
+- **Pillow (PIL)** – image processing
+- **requests** – HTTP client for external APIs
+- **numpy** – numerical operations
+- **MusicGen (AudioCraft)** – soundtrack and SFX generation
+- **FLUX (black-forest-labs/FLUX.1-dev)** – on-the-fly image generation
+- **Gemma-3-27B-IT** – multimodal captioning and metadata