---
title: Scriptura
emoji: 🏆
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
tag: agent-demo-track
---

# Scriptura: A Multi-Agent System for Screenplay Creation and Editing

Explanation **video** at: https://www.youtube.com/watch?v=I0201ruB1Uo&ab_channel=3DLabFactory. 

The screenplay used in the video as sample is available at: https://www.studiobinder.com/blog/best-free-movie-scripts-online/

## Introduction

**Scriptura** is a multi-agent AI framework based on HF-SmolAgents that streamlines the creation of screenplays, storyboards, and soundtracks by automating the stages of analysis, summarization, and multimodal enrichment—freeing authors to focus on pure creativity.

At its heart:
- **Qwen3-32B** serves as the primary orchestrating agent, coordinating workflows and managing high-level reasoning across the system.
- **Gemma-3-27B-IT** acts as a specialized assistant for multimodal tasks, supporting both text and audio inputs to refine narrative elements and prepare them for downstream generation.

For media generation, Scriptura integrates:
- **MusicGen** models (per the AudioCraft MusicGen specification), deployed via Hugging Face Spaces, enabling the agent to produce original soundtracks and sound effects from text prompts or combined text + audio samples.
- **FLUX (black-forest-labs/FLUX.1-dev)** for on-the-fly image creation—ideal for storyboards, concept art, and visual references that seamlessly tie into the narrative flow.

Optionally, Scriptura can query external sources (e.g., via a DuckDuckGo API integration) to pull in reference scripts, sound samples, or research materials, ensuring that every draft is not only creatively rich but also contextually informed.

## Agent Capabilities

**Input File Parsing**  
:  - **Formats accepted**: `TXT`, `PDF`, `DOCX`, `JPEG/PNG`, `MP3/WAV`  
   - **Process**: PDF/DOCX → plain text; OCR on images; speech-to-text on audio.  
   - **Why it matters**: Provides structured input for all downstream modules.

**Overall Plot Summary**  
:  - **Model**: `DeepSeek-R1`  
   - **Output**: 4–6 sentence summary of main narrative threads (timeframe, tone).  
   - **Mechanics**: API calls to DeepSeek with retry logic for improved coherence.

**Entity & Theme Extraction**  
:  - **Technique**: Named Entity Recognition (via **DeepSeek**)  
   - **Extracts**: Characters, locations, key events, recurring themes, narrative tone.  
   - **Output**: JSON/CSV + ~5-sentence abstract.

**Rights & Licensing Verification**  
:  - **Web Search ON**: Queries DuckDuckGo API → fetch license info if match.  
   - **Web Search OFF**: May recognize very famous works internally (e.g. “Harry Potter”) but not guaranteed.  
   - **If no match & search OFF**: No licensing check.

**Image Generation (Storyboard & Concept Art)**  
:  - **Model**: `FLUX (black-forest-labs/FLUX.1-dev)`  
   - **Trigger**: “Generate Image” / storyboard phase.  
   - **Process**: DeepSeek crafts cinematic prompt → FLUX returns PNG/JPEG + caption.

**Audio Generation (Music & Sound Effects)**  
:  - **Model**: `MusicGen (facebook/musicgen-melody)`  
   - **Trigger**: “Generate Audio.”  
   - **Process**: Send prompt → receive MP3/WAV (standalone audio, no text/images).

**In-Depth Analysis of Key Points**  
:  - **Extracts**:  
     - Characters (role, gender, description)  
     - Locations (interior/exterior, period, geography)  
     - Plot Points (crucial narrative beats via Story Understanding models)  
   - **Extras**: Semantic toponym extraction → internal scene maps; detect transitions (“Suddenly,” “Meanwhile”).

**Optional Web Search**  
:  - **Checkbox** toggles DuckDuckGo API lookups.  
   - **If Enabled**: search preconfigured sites (free & paid) for scripts, sound effects.  
   - **Output**: List of links + short summaries.


---

## Agent Flow

```mermaid
flowchart LR
    A[Start Agent] --> B[Load Input (text, image, audio)]
    B --> C[Preprocessing: PDF/DOCX → text, OCR, audio transcription]
    C --> D[Generate Plot Summary (DeepSeek)]
    D --> E[Extract Entities & Themes (DeepSeek)]
    E --> F {Web Search Enabled?}
    F -->|Yes| G[Web Search via DuckDuckGo API]
    F -->|No| H[Continue Offline Analysis]
    H --> I[Rights & Licensing Check]
    I --> J[Deep Analysis: characters, locations, plot points]
    J --> K {Image Generation Requested?}
    K -->|Yes| L[API Call to FLUX for storyboard/concept art]
    K -->|No| M[Skip Image Generation]
    M --> N {Audio Generation Requested?}
    N -->|Yes| O[API Call to MusicGen for audio tracks]
    N -->|No| P[Skip Audio Generation]
    L & O --> Q[Final Output: text, JSON/CSV, images, audio]
```
---
## Deployment & Access and the Code Overview

---
## Use Cases

**Independent Writer**  
:  - Upload a screenplay and quickly get a summary, a list of characters, and locations.  
   - Create visual storyboards of key narrative moments via FLUX (PNG/JPEG outputs).  
   - Generate brief soundtracks or sound effects to accompany script presentations (MP3/WAV).

**Film Production Company**  
:  - Import multiple screenplays (PDF, DOCX) and automatically receive reports on characters, locations, and potential copyright issues.  
   - Use the web search feature to find reference scripts or specific sound effects from free/paid sources.  
   - Develop visual storyboards and audio prototypes to share with directors, artists, and investors.

**Translation and Adaptation Agency**  
:  - Upload foreign-language scripts and obtain a structured text version with extracted entities (JSON/CSV).  
   - Generate contextual images for cultural adaptation (e.g., images matching the original setting via FLUX).  
   - Produce reference audio via MusicGen to test culturally appropriate music for the target audience.

**Digital Humanities Course**  
:  - Demonstrate how to build a text-mining tool applied to performing arts, combining NLP, image, and audio pipelines.  
   - Allow students to analyze real scripts, generate abstracts, scene maps, and visual/audio prototypes in a hands-on environment.  
   - Explore Transformer models (DeepSeek), OCR, speech-to-text, and AI-driven media generation as part of the curriculum.

---
## Credits


---
## Acknowledgements


---
### Contributors: 
- Code development and implementation made by **luke9705**; 
- Ideas creation, testing and videomaking conducted by **OrianIce**;
- Research and testing by **Loren1214**;
- Code revisions by **DDPM**.

---
### Sources

- Russell, S., & Norvig, P. (2021). *Artificial Intelligence: A Modern Approach* (3rd ed.). Pearson.
- Cambria, E., & White, B. (2014). *Jumping NLP Curves: A Review of Natural Language Processing Research*. IEEE Computational Intelligence Magazine, 9(2), 48–57.
- Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., … & Sutskever, I. (2022). *Hierarchical Text-Conditional Image Generation with CLIP Latents*. arXiv preprint arXiv:2204.06125.