Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
title: DocuMaker
emoji: π¬
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.18.0
app_file: app.py
python_version: '3.11'
pinned: false
license: mit
short_description: Turn a tutorial video into a step-by-step DOCX guide
DocuMaker
Turn a tutorial/screencast video into a polished step-by-step .docx guide
with screenshots β using open-source tools and free HuggingFace models.
The block above is HuggingFace Spaces config. On GitHub it renders as a small table; on Spaces it tells the platform how to run the app.
Pipeline: video β preview β frames (manual + automatic) β audio transcription (Whisper) β LLM cleanup & step-structuring β image/step alignment + captions β DOCX export.
- UI: a local Gradio app.
- Transcription: runs locally with faster-whisper (GPU with automatic CPU fallback).
- Guide writing: the HuggingFace Inference API (uses your HF token).
- Image captions: a vision model. It tries an API vision-chat model first, then falls back to a local BLIP captioner β which is the path most free HF accounts end up using, since few enabled providers serve a vision model on the free tier.
- Frames: one-click manual snapshots and automatic scene-detection (PySceneDetect) de-duplicated with perceptual hashing.
How a frame is chosen for each step
Relevancy is decided by combining accurate signals (no single model "judges" it):
- Timestamp alignment β the frame on screen while that step was narrated (Whisper timestamps β step time). The strongest signal for tutorials.
- Sharpness β variance-of-Laplacian, to avoid blurry scene-transition frames.
- BLIP caption match β the BLIP caption is compared to the step's text; a frame whose description overlaps the step gets a nudge. This suggests, it doesn't decide.
- Manual preference β frames you snapshot yourself win ties.
See _pick_frame in src/guide.py.
Requirements
- Python 3.11
- ffmpeg on your
PATH(ffmpeg -versionshould work) - A HuggingFace token β get one at https://huggingface.co/settings/tokens
(a free Read token works). You paste it into the app's UI; it is not read
from the environment. (The headless smoke test reads it from
HF_TOKEN/HUGGINGFACEHUB_API_TOKENinstead.) - A CUDA GPU is optional (Whisper falls back to CPU automatically)
Setup
python -m venv .venv
# Windows (PowerShell): .venv\Scripts\Activate.ps1
# Git Bash: source .venv/Scripts/activate
pip install -r requirements.txt
cp .env.example .env # optional β tweak model ids / Whisper size
requirements.txt includes the local BLIP captioner (torch + transformers,
CPU build). The BLIP model (~1 GB) downloads on first use and is cached. Captions
are nice-to-have: without a captioner you still get images + step text, and frame
selection uses timestamp + sharpness only.
To run BLIP on a local NVIDIA GPU, reinstall a CUDA build of torch β it's used automatically when available:
pip install torch --index-url https://download.pytorch.org/whl/cu124
HuggingFace token
The app takes your token from the UI β paste it into the "π HuggingFace token"
box at the top of the page. When you enter it (local single-user mode), the app
sets it as the process HF_TOKEN (in memory only, never written to disk) so every
HuggingFace operation β the guide LLM and model downloads β authenticates with it,
overriding any stale token already in your environment.
Shared / multi-user deployments: if you launch with DOCUMAKER_SHARE=1 or a
non-localhost DOCUMAKER_SERVER_NAME, the app switches to per-session tokens and
does not touch the global HF_TOKEN β so one user's token can never leak to
another. The token is still threaded directly to that user's LLM/caption calls (the
guide LLM clients are created per request, never cached across sessions).
The headless smoke test (scripts/smoke_test.py) instead reads the token from
HF_TOKEN / HUGGINGFACEHUB_API_TOKEN and validates which one actually
authenticates (handy if one is stale).
Run
python app.py
Open the printed local URL, then:
- Paste your HuggingFace token into the π box at the top.
- Upload & preview a video.
- Capture frames β scrub the seek bar and click πΈ Capture current frame, and/or click β¨ Auto-extract frames for scene-based snapshots.
- Transcribe audio (Whisper, local). Edit the transcript if you like.
- Generate step-by-step guide (HF LLM) and review the steps.
- Build DOCX β images are matched to steps and captioned, then download
guide.docx.
Quick backend check (no UI)
python scripts/make_sample.py # synthesizes work/sample/sample.mp4 (4 scenes + narration)
python scripts/smoke_test.py # runs the full pipeline and asserts a valid DOCX
The smoke test falls back to a naive draft if the HF API is unreachable, so DOCX assembly is always exercised.
Deploy to HuggingFace Spaces
This repo is Spaces-ready: the YAML header in this README, packages.txt (installs
ffmpeg), and a single requirements.txt are all the platform needs. The app
auto-detects Spaces (via SPACE_ID), binds to 0.0.0.0, and switches to
multi-user-safe tokens β each visitor pastes their own HF token in the UI and it
never touches the shared environment.
- Create a new Gradio Space at https://huggingface.co/new-space.
- Push this repo to it:
(Or drag the files into the Space's Files tab.)git init && git add -A && git commit -m "DocuMaker" git remote add space https://huggingface.co/spaces/<your-username>/<space-name> git push space main - The Space builds and starts automatically. Visitors paste their own HF token β you don't expose yours or share its quota.
Notes for the free CPU tier: transcription and BLIP run on CPU (slower but fine for a
demo); set DOCUMAKER_WHISPER_MODEL=base in the Space Settings β Variables for
snappier transcription. You can push to both GitHub and the Space (add both as
git remotes).
Configuration
All settings are environment variables (see .env.example). Highlights:
| Variable | Default | Purpose |
|---|---|---|
DOCUMAKER_LLM_MODEL |
Qwen/Qwen2.5-7B-Instruct |
Text LLM (any HF instruct model) |
DOCUMAKER_VLM_MODEL |
Qwen/Qwen2-VL-7B-Instruct |
API vision model tried before local BLIP |
DOCUMAKER_LOCAL_CAPTION_MODEL |
Salesforce/blip-image-captioning-base |
Local captioner |
DOCUMAKER_ENABLE_VISION |
1 |
Set 0 to skip captioning |
DOCUMAKER_WHISPER_MODEL |
small |
tinyβ¦large-v3 |
DOCUMAKER_WHISPER_DEVICE |
auto |
auto / cuda / cpu |
DOCUMAKER_SCENE_THRESHOLD |
27.0 |
Lower = more auto-frames |
DOCUMAKER_SHARE |
0 |
1 = Gradio public share link (enables multi-user-safe tokens) |
DOCUMAKER_SERVER_NAME |
127.0.0.1 |
Bind address; non-localhost enables multi-user-safe tokens |
Troubleshooting
- Whisper CUDA errors / cuDNN not found: faster-whisper (CTranslate2) needs the
NVIDIA CUDA libraries. Either install them β
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12β or force CPU withDOCUMAKER_WHISPER_DEVICE=cpu(slower but always works). - LLM call failed / model not available: free-tier model availability changes.
Set
DOCUMAKER_LLM_MODELto another available instruct model, or pin a provider withDOCUMAKER_LLM_PROVIDER. - No captions in the DOCX: the API vision model usually isn't served on free HF
accounts ("not supported by any provider you have enabled"), so DocuMaker uses
local BLIP (installed via
requirements.txt). Frame selection and images still work without it. - Video doesn't preview: the app serves files from
work/via Gradio'sallowed_paths(URL prefix/gradio_api/file=). Make sure the upload completed; very large files take a moment. - No images in the DOCX: capture or auto-extract frames before Build DOCX. Steps with a timestamp but no nearby frame pull a fresh frame from the video.
Project layout
app.py Gradio UI + event wiring
src/config.py env-driven settings
src/video.py ffmpeg audio extract / duration / frame@timestamp
src/frames.py scene detection, dedup, manual-capture decode
src/transcribe.py faster-whisper (CUDAβCPU fallback)
src/llm.py HF Inference: transcript β structured step JSON
src/vision.py VLM captioning (HF API) + local BLIP fallback
src/guide.py align framesβsteps, caption
src/docx_export.py python-docx assembly
src/web/player.html custom HTML5 player for seek + snapshot
scripts/make_sample.py synthesize a test clip
scripts/smoke_test.py headless end-to-end check