DocuMaker / README.md
vivekchakraverty's picture
DocuMaker: video to step-by-step DOCX guide (Whisper + HF LLM + BLIP)
85b485a
|
Raw
History Blame Contribute Delete
8.98 kB
---
title: DocuMaker
emoji: 🎬
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.18.0
app_file: app.py
python_version: "3.11"
pinned: false
license: mit
short_description: Turn a tutorial video into a step-by-step DOCX guide
---
# DocuMaker
Turn a tutorial/screencast **video** into a polished **step-by-step `.docx` guide**
with screenshots — using open-source tools and **free** HuggingFace models.
> The block above is HuggingFace Spaces config. On GitHub it renders as a small
> table; on Spaces it tells the platform how to run the app.
Pipeline: `video → preview → frames (manual + automatic) → audio transcription
(Whisper) → LLM cleanup & step-structuring → image/step alignment + captions →
DOCX export`.
- **UI:** a local Gradio app.
- **Transcription:** runs **locally** with [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
(GPU with automatic CPU fallback).
- **Guide writing:** the **HuggingFace Inference API** (uses your HF token).
- **Image captions:** a vision model. It tries an API vision-chat model first, then
falls back to a **local BLIP** captioner — which is the path most free HF accounts
end up using, since few enabled providers serve a vision model on the free tier.
- **Frames:** one-click manual snapshots **and** automatic scene-detection
(PySceneDetect) de-duplicated with perceptual hashing.
### How a frame is chosen for each step
Relevancy is decided by combining accurate signals (no single model "judges" it):
1. **Timestamp alignment** — the frame on screen while that step was narrated
(Whisper timestamps ↔ step time). The strongest signal for tutorials.
2. **Sharpness** — variance-of-Laplacian, to avoid blurry scene-transition frames.
3. **BLIP caption match** — the BLIP caption is compared to the step's text; a frame
whose description overlaps the step gets a nudge. This *suggests*, it doesn't decide.
4. **Manual preference** — frames you snapshot yourself win ties.
See `_pick_frame` in [src/guide.py](src/guide.py).
## Requirements
- Python 3.11
- **ffmpeg** on your `PATH` (`ffmpeg -version` should work)
- A HuggingFace token — get one at <https://huggingface.co/settings/tokens>
(a free **Read** token works). You **paste it into the app's UI**; it is not read
from the environment. (The headless smoke test reads it from `HF_TOKEN` /
`HUGGINGFACEHUB_API_TOKEN` instead.)
- A CUDA GPU is optional (Whisper falls back to CPU automatically)
## Setup
```bash
python -m venv .venv
# Windows (PowerShell): .venv\Scripts\Activate.ps1
# Git Bash: source .venv/Scripts/activate
pip install -r requirements.txt
cp .env.example .env # optional — tweak model ids / Whisper size
```
`requirements.txt` includes the local **BLIP** captioner (`torch` + `transformers`,
CPU build). The BLIP model (~1 GB) downloads on first use and is cached. Captions
are nice-to-have: without a captioner you still get images + step text, and frame
selection uses timestamp + sharpness only.
To run BLIP on a local **NVIDIA GPU**, reinstall a CUDA build of torch — it's used
automatically when available:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu124
```
### HuggingFace token
The **app takes your token from the UI** — paste it into the "🔑 HuggingFace token"
box at the top of the page. When you enter it (local single-user mode), the app
sets it as the process `HF_TOKEN` (in memory only, never written to disk) so every
HuggingFace operation — the guide LLM and model downloads — authenticates with it,
overriding any stale token already in your environment.
**Shared / multi-user deployments:** if you launch with `DOCUMAKER_SHARE=1` or a
non-localhost `DOCUMAKER_SERVER_NAME`, the app switches to per-session tokens and
**does not** touch the global `HF_TOKEN` — so one user's token can never leak to
another. The token is still threaded directly to that user's LLM/caption calls (the
guide LLM clients are created per request, never cached across sessions).
The **headless smoke test** (`scripts/smoke_test.py`) instead reads the token from
`HF_TOKEN` / `HUGGINGFACEHUB_API_TOKEN` and validates which one actually
authenticates (handy if one is stale).
## Run
```bash
python app.py
```
Open the printed local URL, then:
1. **Paste your HuggingFace token** into the 🔑 box at the top.
2. **Upload & preview** a video.
3. **Capture frames** — scrub the seek bar and click *📸 Capture current frame*,
and/or click *✨ Auto-extract frames* for scene-based snapshots.
4. **Transcribe audio** (Whisper, local). Edit the transcript if you like.
5. **Generate step-by-step guide** (HF LLM) and review the steps.
6. **Build DOCX** — images are matched to steps and captioned, then download `guide.docx`.
## Quick backend check (no UI)
```bash
python scripts/make_sample.py # synthesizes work/sample/sample.mp4 (4 scenes + narration)
python scripts/smoke_test.py # runs the full pipeline and asserts a valid DOCX
```
The smoke test falls back to a naive draft if the HF API is unreachable, so DOCX
assembly is always exercised.
## Deploy to HuggingFace Spaces
This repo is Spaces-ready: the YAML header in this README, `packages.txt` (installs
`ffmpeg`), and a single `requirements.txt` are all the platform needs. The app
auto-detects Spaces (via `SPACE_ID`), binds to `0.0.0.0`, and switches to
**multi-user-safe tokens** — each visitor pastes their own HF token in the UI and it
never touches the shared environment.
1. Create a new **Gradio** Space at <https://huggingface.co/new-space>.
2. Push this repo to it:
```bash
git init && git add -A && git commit -m "DocuMaker"
git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
git push space main
```
(Or drag the files into the Space's *Files* tab.)
3. The Space builds and starts automatically. Visitors paste their **own** HF token —
you don't expose yours or share its quota.
Notes for the free CPU tier: transcription and BLIP run on CPU (slower but fine for a
demo); set `DOCUMAKER_WHISPER_MODEL=base` in the Space *Settings → Variables* for
snappier transcription. You can push to **both** GitHub and the Space (add both as
git remotes).
## Configuration
All settings are environment variables (see [.env.example](.env.example)). Highlights:
| Variable | Default | Purpose |
|---|---|---|
| `DOCUMAKER_LLM_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | Text LLM (any HF instruct model) |
| `DOCUMAKER_VLM_MODEL` | `Qwen/Qwen2-VL-7B-Instruct` | API vision model tried before local BLIP |
| `DOCUMAKER_LOCAL_CAPTION_MODEL` | `Salesforce/blip-image-captioning-base` | Local captioner |
| `DOCUMAKER_ENABLE_VISION` | `1` | Set `0` to skip captioning |
| `DOCUMAKER_WHISPER_MODEL` | `small` | `tiny``large-v3` |
| `DOCUMAKER_WHISPER_DEVICE` | `auto` | `auto` / `cuda` / `cpu` |
| `DOCUMAKER_SCENE_THRESHOLD` | `27.0` | Lower = more auto-frames |
| `DOCUMAKER_SHARE` | `0` | `1` = Gradio public share link (enables multi-user-safe tokens) |
| `DOCUMAKER_SERVER_NAME` | `127.0.0.1` | Bind address; non-localhost enables multi-user-safe tokens |
## Troubleshooting
- **Whisper CUDA errors / cuDNN not found:** faster-whisper (CTranslate2) needs the
NVIDIA CUDA libraries. Either install them —
`pip install nvidia-cublas-cu12 nvidia-cudnn-cu12` — or force CPU with
`DOCUMAKER_WHISPER_DEVICE=cpu` (slower but always works).
- **LLM call failed / model not available:** free-tier model availability changes.
Set `DOCUMAKER_LLM_MODEL` to another available instruct model, or pin a provider
with `DOCUMAKER_LLM_PROVIDER`.
- **No captions in the DOCX:** the API vision model usually isn't served on free HF
accounts ("not supported by any provider you have enabled"), so DocuMaker uses
local BLIP (installed via `requirements.txt`). Frame *selection* and images still
work without it.
- **Video doesn't preview:** the app serves files from `work/` via Gradio's
`allowed_paths` (URL prefix `/gradio_api/file=`). Make sure the upload completed;
very large files take a moment.
- **No images in the DOCX:** capture or auto-extract frames before *Build DOCX*.
Steps with a timestamp but no nearby frame pull a fresh frame from the video.
## Project layout
```
app.py Gradio UI + event wiring
src/config.py env-driven settings
src/video.py ffmpeg audio extract / duration / frame@timestamp
src/frames.py scene detection, dedup, manual-capture decode
src/transcribe.py faster-whisper (CUDA→CPU fallback)
src/llm.py HF Inference: transcript → structured step JSON
src/vision.py VLM captioning (HF API) + local BLIP fallback
src/guide.py align frames↔steps, caption
src/docx_export.py python-docx assembly
src/web/player.html custom HTML5 player for seek + snapshot
scripts/make_sample.py synthesize a test clip
scripts/smoke_test.py headless end-to-end check
```