DocuMaker / README.md
vivekchakraverty's picture
DocuMaker: video to step-by-step DOCX guide (Whisper + HF LLM + BLIP)
85b485a
|
Raw
History Blame Contribute Delete
8.98 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: DocuMaker
emoji: 🎬
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.18.0
app_file: app.py
python_version: '3.11'
pinned: false
license: mit
short_description: Turn a tutorial video into a step-by-step DOCX guide

DocuMaker

Turn a tutorial/screencast video into a polished step-by-step .docx guide with screenshots β€” using open-source tools and free HuggingFace models.

The block above is HuggingFace Spaces config. On GitHub it renders as a small table; on Spaces it tells the platform how to run the app.

Pipeline: video β†’ preview β†’ frames (manual + automatic) β†’ audio transcription (Whisper) β†’ LLM cleanup & step-structuring β†’ image/step alignment + captions β†’ DOCX export.

  • UI: a local Gradio app.
  • Transcription: runs locally with faster-whisper (GPU with automatic CPU fallback).
  • Guide writing: the HuggingFace Inference API (uses your HF token).
  • Image captions: a vision model. It tries an API vision-chat model first, then falls back to a local BLIP captioner β€” which is the path most free HF accounts end up using, since few enabled providers serve a vision model on the free tier.
  • Frames: one-click manual snapshots and automatic scene-detection (PySceneDetect) de-duplicated with perceptual hashing.

How a frame is chosen for each step

Relevancy is decided by combining accurate signals (no single model "judges" it):

  1. Timestamp alignment β€” the frame on screen while that step was narrated (Whisper timestamps ↔ step time). The strongest signal for tutorials.
  2. Sharpness β€” variance-of-Laplacian, to avoid blurry scene-transition frames.
  3. BLIP caption match β€” the BLIP caption is compared to the step's text; a frame whose description overlaps the step gets a nudge. This suggests, it doesn't decide.
  4. Manual preference β€” frames you snapshot yourself win ties.

See _pick_frame in src/guide.py.

Requirements

  • Python 3.11
  • ffmpeg on your PATH (ffmpeg -version should work)
  • A HuggingFace token β€” get one at https://huggingface.co/settings/tokens (a free Read token works). You paste it into the app's UI; it is not read from the environment. (The headless smoke test reads it from HF_TOKEN / HUGGINGFACEHUB_API_TOKEN instead.)
  • A CUDA GPU is optional (Whisper falls back to CPU automatically)

Setup

python -m venv .venv
# Windows (PowerShell):  .venv\Scripts\Activate.ps1
# Git Bash:              source .venv/Scripts/activate
pip install -r requirements.txt

cp .env.example .env   # optional β€” tweak model ids / Whisper size

requirements.txt includes the local BLIP captioner (torch + transformers, CPU build). The BLIP model (~1 GB) downloads on first use and is cached. Captions are nice-to-have: without a captioner you still get images + step text, and frame selection uses timestamp + sharpness only.

To run BLIP on a local NVIDIA GPU, reinstall a CUDA build of torch β€” it's used automatically when available:

pip install torch --index-url https://download.pytorch.org/whl/cu124

HuggingFace token

The app takes your token from the UI β€” paste it into the "πŸ”‘ HuggingFace token" box at the top of the page. When you enter it (local single-user mode), the app sets it as the process HF_TOKEN (in memory only, never written to disk) so every HuggingFace operation β€” the guide LLM and model downloads β€” authenticates with it, overriding any stale token already in your environment.

Shared / multi-user deployments: if you launch with DOCUMAKER_SHARE=1 or a non-localhost DOCUMAKER_SERVER_NAME, the app switches to per-session tokens and does not touch the global HF_TOKEN β€” so one user's token can never leak to another. The token is still threaded directly to that user's LLM/caption calls (the guide LLM clients are created per request, never cached across sessions).

The headless smoke test (scripts/smoke_test.py) instead reads the token from HF_TOKEN / HUGGINGFACEHUB_API_TOKEN and validates which one actually authenticates (handy if one is stale).

Run

python app.py

Open the printed local URL, then:

  1. Paste your HuggingFace token into the πŸ”‘ box at the top.
  2. Upload & preview a video.
  3. Capture frames β€” scrub the seek bar and click πŸ“Έ Capture current frame, and/or click ✨ Auto-extract frames for scene-based snapshots.
  4. Transcribe audio (Whisper, local). Edit the transcript if you like.
  5. Generate step-by-step guide (HF LLM) and review the steps.
  6. Build DOCX β€” images are matched to steps and captioned, then download guide.docx.

Quick backend check (no UI)

python scripts/make_sample.py   # synthesizes work/sample/sample.mp4 (4 scenes + narration)
python scripts/smoke_test.py    # runs the full pipeline and asserts a valid DOCX

The smoke test falls back to a naive draft if the HF API is unreachable, so DOCX assembly is always exercised.

Deploy to HuggingFace Spaces

This repo is Spaces-ready: the YAML header in this README, packages.txt (installs ffmpeg), and a single requirements.txt are all the platform needs. The app auto-detects Spaces (via SPACE_ID), binds to 0.0.0.0, and switches to multi-user-safe tokens β€” each visitor pastes their own HF token in the UI and it never touches the shared environment.

  1. Create a new Gradio Space at https://huggingface.co/new-space.
  2. Push this repo to it:
    git init && git add -A && git commit -m "DocuMaker"
    git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
    git push space main
    
    (Or drag the files into the Space's Files tab.)
  3. The Space builds and starts automatically. Visitors paste their own HF token β€” you don't expose yours or share its quota.

Notes for the free CPU tier: transcription and BLIP run on CPU (slower but fine for a demo); set DOCUMAKER_WHISPER_MODEL=base in the Space Settings β†’ Variables for snappier transcription. You can push to both GitHub and the Space (add both as git remotes).

Configuration

All settings are environment variables (see .env.example). Highlights:

Variable Default Purpose
DOCUMAKER_LLM_MODEL Qwen/Qwen2.5-7B-Instruct Text LLM (any HF instruct model)
DOCUMAKER_VLM_MODEL Qwen/Qwen2-VL-7B-Instruct API vision model tried before local BLIP
DOCUMAKER_LOCAL_CAPTION_MODEL Salesforce/blip-image-captioning-base Local captioner
DOCUMAKER_ENABLE_VISION 1 Set 0 to skip captioning
DOCUMAKER_WHISPER_MODEL small tiny…large-v3
DOCUMAKER_WHISPER_DEVICE auto auto / cuda / cpu
DOCUMAKER_SCENE_THRESHOLD 27.0 Lower = more auto-frames
DOCUMAKER_SHARE 0 1 = Gradio public share link (enables multi-user-safe tokens)
DOCUMAKER_SERVER_NAME 127.0.0.1 Bind address; non-localhost enables multi-user-safe tokens

Troubleshooting

  • Whisper CUDA errors / cuDNN not found: faster-whisper (CTranslate2) needs the NVIDIA CUDA libraries. Either install them β€” pip install nvidia-cublas-cu12 nvidia-cudnn-cu12 β€” or force CPU with DOCUMAKER_WHISPER_DEVICE=cpu (slower but always works).
  • LLM call failed / model not available: free-tier model availability changes. Set DOCUMAKER_LLM_MODEL to another available instruct model, or pin a provider with DOCUMAKER_LLM_PROVIDER.
  • No captions in the DOCX: the API vision model usually isn't served on free HF accounts ("not supported by any provider you have enabled"), so DocuMaker uses local BLIP (installed via requirements.txt). Frame selection and images still work without it.
  • Video doesn't preview: the app serves files from work/ via Gradio's allowed_paths (URL prefix /gradio_api/file=). Make sure the upload completed; very large files take a moment.
  • No images in the DOCX: capture or auto-extract frames before Build DOCX. Steps with a timestamp but no nearby frame pull a fresh frame from the video.

Project layout

app.py                 Gradio UI + event wiring
src/config.py          env-driven settings
src/video.py           ffmpeg audio extract / duration / frame@timestamp
src/frames.py          scene detection, dedup, manual-capture decode
src/transcribe.py      faster-whisper (CUDA→CPU fallback)
src/llm.py             HF Inference: transcript β†’ structured step JSON
src/vision.py          VLM captioning (HF API) + local BLIP fallback
src/guide.py           align frames↔steps, caption
src/docx_export.py     python-docx assembly
src/web/player.html    custom HTML5 player for seek + snapshot
scripts/make_sample.py synthesize a test clip
scripts/smoke_test.py  headless end-to-end check