---
title: DocuMaker
emoji: 🎬
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.18.0
app_file: app.py
python_version: "3.11"
pinned: false
license: mit
short_description: Turn a tutorial video into a step-by-step DOCX guide
---

# DocuMaker

Turn a tutorial/screencast **video** into a polished **step-by-step `.docx` guide**
with screenshots — using open-source tools and **free** HuggingFace models.

> The block above is HuggingFace Spaces config. On GitHub it renders as a small
> table; on Spaces it tells the platform how to run the app.

Pipeline: `video → preview → frames (manual + automatic) → audio transcription
(Whisper) → LLM cleanup & step-structuring → image/step alignment + captions →
DOCX export`.

- **UI:** a local Gradio app.
- **Transcription:** runs **locally** with [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
  (GPU with automatic CPU fallback).
- **Guide writing:** the **HuggingFace Inference API** (uses your HF token).
- **Image captions:** a vision model. It tries an API vision-chat model first, then
  falls back to a **local BLIP** captioner — which is the path most free HF accounts
  end up using, since few enabled providers serve a vision model on the free tier.
- **Frames:** one-click manual snapshots **and** automatic scene-detection
  (PySceneDetect) de-duplicated with perceptual hashing.

### How a frame is chosen for each step

Relevancy is decided by combining accurate signals (no single model "judges" it):

1. **Timestamp alignment** — the frame on screen while that step was narrated
   (Whisper timestamps ↔ step time). The strongest signal for tutorials.
2. **Sharpness** — variance-of-Laplacian, to avoid blurry scene-transition frames.
3. **BLIP caption match** — the BLIP caption is compared to the step's text; a frame
   whose description overlaps the step gets a nudge. This *suggests*, it doesn't decide.
4. **Manual preference** — frames you snapshot yourself win ties.

See `_pick_frame` in [src/guide.py](src/guide.py).

## Requirements

- Python 3.11
- **ffmpeg** on your `PATH` (`ffmpeg -version` should work)
- A HuggingFace token — get one at <https://huggingface.co/settings/tokens>
  (a free **Read** token works). You **paste it into the app's UI**; it is not read
  from the environment. (The headless smoke test reads it from `HF_TOKEN` /
  `HUGGINGFACEHUB_API_TOKEN` instead.)
- A CUDA GPU is optional (Whisper falls back to CPU automatically)

## Setup

```bash
python -m venv .venv
# Windows (PowerShell):  .venv\Scripts\Activate.ps1
# Git Bash:              source .venv/Scripts/activate
pip install -r requirements.txt

cp .env.example .env   # optional — tweak model ids / Whisper size
```

`requirements.txt` includes the local **BLIP** captioner (`torch` + `transformers`,
CPU build). The BLIP model (~1 GB) downloads on first use and is cached. Captions
are nice-to-have: without a captioner you still get images + step text, and frame
selection uses timestamp + sharpness only.

To run BLIP on a local **NVIDIA GPU**, reinstall a CUDA build of torch — it's used
automatically when available:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cu124
```

### HuggingFace token

The **app takes your token from the UI** — paste it into the "🔑 HuggingFace token"
box at the top of the page. When you enter it (local single-user mode), the app
sets it as the process `HF_TOKEN` (in memory only, never written to disk) so every
HuggingFace operation — the guide LLM and model downloads — authenticates with it,
overriding any stale token already in your environment.

**Shared / multi-user deployments:** if you launch with `DOCUMAKER_SHARE=1` or a
non-localhost `DOCUMAKER_SERVER_NAME`, the app switches to per-session tokens and
**does not** touch the global `HF_TOKEN` — so one user's token can never leak to
another. The token is still threaded directly to that user's LLM/caption calls (the
guide LLM clients are created per request, never cached across sessions).

The **headless smoke test** (`scripts/smoke_test.py`) instead reads the token from
`HF_TOKEN` / `HUGGINGFACEHUB_API_TOKEN` and validates which one actually
authenticates (handy if one is stale).

## Run

```bash
python app.py
```

Open the printed local URL, then:

1. **Paste your HuggingFace token** into the 🔑 box at the top.
2. **Upload & preview** a video.
3. **Capture frames** — scrub the seek bar and click *📸 Capture current frame*,
   and/or click *✨ Auto-extract frames* for scene-based snapshots.
4. **Transcribe audio** (Whisper, local). Edit the transcript if you like.
5. **Generate step-by-step guide** (HF LLM) and review the steps.
6. **Build DOCX** — images are matched to steps and captioned, then download `guide.docx`.

## Quick backend check (no UI)

```bash
python scripts/make_sample.py   # synthesizes work/sample/sample.mp4 (4 scenes + narration)
python scripts/smoke_test.py    # runs the full pipeline and asserts a valid DOCX
```

The smoke test falls back to a naive draft if the HF API is unreachable, so DOCX
assembly is always exercised.

## Deploy to HuggingFace Spaces

This repo is Spaces-ready: the YAML header in this README, `packages.txt` (installs
`ffmpeg`), and a single `requirements.txt` are all the platform needs. The app
auto-detects Spaces (via `SPACE_ID`), binds to `0.0.0.0`, and switches to
**multi-user-safe tokens** — each visitor pastes their own HF token in the UI and it
never touches the shared environment.

1. Create a new **Gradio** Space at <https://huggingface.co/new-space>.
2. Push this repo to it:
   ```bash
   git init && git add -A && git commit -m "DocuMaker"
   git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
   git push space main
   ```
   (Or drag the files into the Space's *Files* tab.)
3. The Space builds and starts automatically. Visitors paste their **own** HF token —
   you don't expose yours or share its quota.

Notes for the free CPU tier: transcription and BLIP run on CPU (slower but fine for a
demo); set `DOCUMAKER_WHISPER_MODEL=base` in the Space *Settings → Variables* for
snappier transcription. You can push to **both** GitHub and the Space (add both as
git remotes).

## Configuration

All settings are environment variables (see [.env.example](.env.example)). Highlights:

| Variable | Default | Purpose |
|---|---|---|
| `DOCUMAKER_LLM_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | Text LLM (any HF instruct model) |
| `DOCUMAKER_VLM_MODEL` | `Qwen/Qwen2-VL-7B-Instruct` | API vision model tried before local BLIP |
| `DOCUMAKER_LOCAL_CAPTION_MODEL` | `Salesforce/blip-image-captioning-base` | Local captioner |
| `DOCUMAKER_ENABLE_VISION` | `1` | Set `0` to skip captioning |
| `DOCUMAKER_WHISPER_MODEL` | `small` | `tiny`…`large-v3` |
| `DOCUMAKER_WHISPER_DEVICE` | `auto` | `auto` / `cuda` / `cpu` |
| `DOCUMAKER_SCENE_THRESHOLD` | `27.0` | Lower = more auto-frames |
| `DOCUMAKER_SHARE` | `0` | `1` = Gradio public share link (enables multi-user-safe tokens) |
| `DOCUMAKER_SERVER_NAME` | `127.0.0.1` | Bind address; non-localhost enables multi-user-safe tokens |

## Troubleshooting

- **Whisper CUDA errors / cuDNN not found:** faster-whisper (CTranslate2) needs the
  NVIDIA CUDA libraries. Either install them —
  `pip install nvidia-cublas-cu12 nvidia-cudnn-cu12` — or force CPU with
  `DOCUMAKER_WHISPER_DEVICE=cpu` (slower but always works).
- **LLM call failed / model not available:** free-tier model availability changes.
  Set `DOCUMAKER_LLM_MODEL` to another available instruct model, or pin a provider
  with `DOCUMAKER_LLM_PROVIDER`.
- **No captions in the DOCX:** the API vision model usually isn't served on free HF
  accounts ("not supported by any provider you have enabled"), so DocuMaker uses
  local BLIP (installed via `requirements.txt`). Frame *selection* and images still
  work without it.
- **Video doesn't preview:** the app serves files from `work/` via Gradio's
  `allowed_paths` (URL prefix `/gradio_api/file=`). Make sure the upload completed;
  very large files take a moment.
- **No images in the DOCX:** capture or auto-extract frames before *Build DOCX*.
  Steps with a timestamp but no nearby frame pull a fresh frame from the video.

## Project layout

```
app.py                 Gradio UI + event wiring
src/config.py          env-driven settings
src/video.py           ffmpeg audio extract / duration / frame@timestamp
src/frames.py          scene detection, dedup, manual-capture decode
src/transcribe.py      faster-whisper (CUDA→CPU fallback)
src/llm.py             HF Inference: transcript → structured step JSON
src/vision.py          VLM captioning (HF API) + local BLIP fallback
src/guide.py           align frames↔steps, caption
src/docx_export.py     python-docx assembly
src/web/player.html    custom HTML5 player for seek + snapshot
scripts/make_sample.py synthesize a test clip
scripts/smoke_test.py  headless end-to-end check
```