--- title: DocuMaker emoji: 🎬 colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 6.18.0 app_file: app.py python_version: "3.11" pinned: false license: mit short_description: Turn a tutorial video into a step-by-step DOCX guide --- # DocuMaker Turn a tutorial/screencast **video** into a polished **step-by-step `.docx` guide** with screenshots β€” using open-source tools and **free** HuggingFace models. > The block above is HuggingFace Spaces config. On GitHub it renders as a small > table; on Spaces it tells the platform how to run the app. Pipeline: `video β†’ preview β†’ frames (manual + automatic) β†’ audio transcription (Whisper) β†’ LLM cleanup & step-structuring β†’ image/step alignment + captions β†’ DOCX export`. - **UI:** a local Gradio app. - **Transcription:** runs **locally** with [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (GPU with automatic CPU fallback). - **Guide writing:** the **HuggingFace Inference API** (uses your HF token). - **Image captions:** a vision model. It tries an API vision-chat model first, then falls back to a **local BLIP** captioner β€” which is the path most free HF accounts end up using, since few enabled providers serve a vision model on the free tier. - **Frames:** one-click manual snapshots **and** automatic scene-detection (PySceneDetect) de-duplicated with perceptual hashing. ### How a frame is chosen for each step Relevancy is decided by combining accurate signals (no single model "judges" it): 1. **Timestamp alignment** β€” the frame on screen while that step was narrated (Whisper timestamps ↔ step time). The strongest signal for tutorials. 2. **Sharpness** β€” variance-of-Laplacian, to avoid blurry scene-transition frames. 3. **BLIP caption match** β€” the BLIP caption is compared to the step's text; a frame whose description overlaps the step gets a nudge. This *suggests*, it doesn't decide. 4. **Manual preference** β€” frames you snapshot yourself win ties. See `_pick_frame` in [src/guide.py](src/guide.py). ## Requirements - Python 3.11 - **ffmpeg** on your `PATH` (`ffmpeg -version` should work) - A HuggingFace token β€” get one at (a free **Read** token works). You **paste it into the app's UI**; it is not read from the environment. (The headless smoke test reads it from `HF_TOKEN` / `HUGGINGFACEHUB_API_TOKEN` instead.) - A CUDA GPU is optional (Whisper falls back to CPU automatically) ## Setup ```bash python -m venv .venv # Windows (PowerShell): .venv\Scripts\Activate.ps1 # Git Bash: source .venv/Scripts/activate pip install -r requirements.txt cp .env.example .env # optional β€” tweak model ids / Whisper size ``` `requirements.txt` includes the local **BLIP** captioner (`torch` + `transformers`, CPU build). The BLIP model (~1 GB) downloads on first use and is cached. Captions are nice-to-have: without a captioner you still get images + step text, and frame selection uses timestamp + sharpness only. To run BLIP on a local **NVIDIA GPU**, reinstall a CUDA build of torch β€” it's used automatically when available: ```bash pip install torch --index-url https://download.pytorch.org/whl/cu124 ``` ### HuggingFace token The **app takes your token from the UI** β€” paste it into the "πŸ”‘ HuggingFace token" box at the top of the page. When you enter it (local single-user mode), the app sets it as the process `HF_TOKEN` (in memory only, never written to disk) so every HuggingFace operation β€” the guide LLM and model downloads β€” authenticates with it, overriding any stale token already in your environment. **Shared / multi-user deployments:** if you launch with `DOCUMAKER_SHARE=1` or a non-localhost `DOCUMAKER_SERVER_NAME`, the app switches to per-session tokens and **does not** touch the global `HF_TOKEN` β€” so one user's token can never leak to another. The token is still threaded directly to that user's LLM/caption calls (the guide LLM clients are created per request, never cached across sessions). The **headless smoke test** (`scripts/smoke_test.py`) instead reads the token from `HF_TOKEN` / `HUGGINGFACEHUB_API_TOKEN` and validates which one actually authenticates (handy if one is stale). ## Run ```bash python app.py ``` Open the printed local URL, then: 1. **Paste your HuggingFace token** into the πŸ”‘ box at the top. 2. **Upload & preview** a video. 3. **Capture frames** β€” scrub the seek bar and click *πŸ“Έ Capture current frame*, and/or click *✨ Auto-extract frames* for scene-based snapshots. 4. **Transcribe audio** (Whisper, local). Edit the transcript if you like. 5. **Generate step-by-step guide** (HF LLM) and review the steps. 6. **Build DOCX** β€” images are matched to steps and captioned, then download `guide.docx`. ## Quick backend check (no UI) ```bash python scripts/make_sample.py # synthesizes work/sample/sample.mp4 (4 scenes + narration) python scripts/smoke_test.py # runs the full pipeline and asserts a valid DOCX ``` The smoke test falls back to a naive draft if the HF API is unreachable, so DOCX assembly is always exercised. ## Deploy to HuggingFace Spaces This repo is Spaces-ready: the YAML header in this README, `packages.txt` (installs `ffmpeg`), and a single `requirements.txt` are all the platform needs. The app auto-detects Spaces (via `SPACE_ID`), binds to `0.0.0.0`, and switches to **multi-user-safe tokens** β€” each visitor pastes their own HF token in the UI and it never touches the shared environment. 1. Create a new **Gradio** Space at . 2. Push this repo to it: ```bash git init && git add -A && git commit -m "DocuMaker" git remote add space https://huggingface.co/spaces// git push space main ``` (Or drag the files into the Space's *Files* tab.) 3. The Space builds and starts automatically. Visitors paste their **own** HF token β€” you don't expose yours or share its quota. Notes for the free CPU tier: transcription and BLIP run on CPU (slower but fine for a demo); set `DOCUMAKER_WHISPER_MODEL=base` in the Space *Settings β†’ Variables* for snappier transcription. You can push to **both** GitHub and the Space (add both as git remotes). ## Configuration All settings are environment variables (see [.env.example](.env.example)). Highlights: | Variable | Default | Purpose | |---|---|---| | `DOCUMAKER_LLM_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | Text LLM (any HF instruct model) | | `DOCUMAKER_VLM_MODEL` | `Qwen/Qwen2-VL-7B-Instruct` | API vision model tried before local BLIP | | `DOCUMAKER_LOCAL_CAPTION_MODEL` | `Salesforce/blip-image-captioning-base` | Local captioner | | `DOCUMAKER_ENABLE_VISION` | `1` | Set `0` to skip captioning | | `DOCUMAKER_WHISPER_MODEL` | `small` | `tiny`…`large-v3` | | `DOCUMAKER_WHISPER_DEVICE` | `auto` | `auto` / `cuda` / `cpu` | | `DOCUMAKER_SCENE_THRESHOLD` | `27.0` | Lower = more auto-frames | | `DOCUMAKER_SHARE` | `0` | `1` = Gradio public share link (enables multi-user-safe tokens) | | `DOCUMAKER_SERVER_NAME` | `127.0.0.1` | Bind address; non-localhost enables multi-user-safe tokens | ## Troubleshooting - **Whisper CUDA errors / cuDNN not found:** faster-whisper (CTranslate2) needs the NVIDIA CUDA libraries. Either install them β€” `pip install nvidia-cublas-cu12 nvidia-cudnn-cu12` β€” or force CPU with `DOCUMAKER_WHISPER_DEVICE=cpu` (slower but always works). - **LLM call failed / model not available:** free-tier model availability changes. Set `DOCUMAKER_LLM_MODEL` to another available instruct model, or pin a provider with `DOCUMAKER_LLM_PROVIDER`. - **No captions in the DOCX:** the API vision model usually isn't served on free HF accounts ("not supported by any provider you have enabled"), so DocuMaker uses local BLIP (installed via `requirements.txt`). Frame *selection* and images still work without it. - **Video doesn't preview:** the app serves files from `work/` via Gradio's `allowed_paths` (URL prefix `/gradio_api/file=`). Make sure the upload completed; very large files take a moment. - **No images in the DOCX:** capture or auto-extract frames before *Build DOCX*. Steps with a timestamp but no nearby frame pull a fresh frame from the video. ## Project layout ``` app.py Gradio UI + event wiring src/config.py env-driven settings src/video.py ffmpeg audio extract / duration / frame@timestamp src/frames.py scene detection, dedup, manual-capture decode src/transcribe.py faster-whisper (CUDAβ†’CPU fallback) src/llm.py HF Inference: transcript β†’ structured step JSON src/vision.py VLM captioning (HF API) + local BLIP fallback src/guide.py align frames↔steps, caption src/docx_export.py python-docx assembly src/web/player.html custom HTML5 player for seek + snapshot scripts/make_sample.py synthesize a test clip scripts/smoke_test.py headless end-to-end check ```