Spaces:

vivekchakraverty
/

DocuMaker

Sleeping

App Files Files Community

DocuMaker / README.md

vivekchakraverty

DocuMaker: video to step-by-step DOCX guide (Whisper + HF LLM + BLIP)

85b485a 19 days ago

preview code

Raw

History Blame Contribute Delete

8.98 kB

	---
	title: DocuMaker
	emoji: 🎬
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 6.18.0
	app_file: app.py
	python_version: "3.11"
	pinned: false
	license: mit
	short_description: Turn a tutorial video into a step-by-step DOCX guide
	---

	# DocuMaker

	Turn a tutorial/screencast video into a polished step-by-step `.docx` guide
	with screenshots — using open-source tools and free HuggingFace models.

	> The block above is HuggingFace Spaces config. On GitHub it renders as a small
	> table; on Spaces it tells the platform how to run the app.

	Pipeline: `video → preview → frames (manual + automatic) → audio transcription
	(Whisper) → LLM cleanup & step-structuring → image/step alignment + captions →
	DOCX export`.

	- UI: a local Gradio app.
	- Transcription: runs locally with [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
	(GPU with automatic CPU fallback).
	- Guide writing: the HuggingFace Inference API (uses your HF token).
	- Image captions: a vision model. It tries an API vision-chat model first, then
	falls back to a local BLIP captioner — which is the path most free HF accounts
	end up using, since few enabled providers serve a vision model on the free tier.
	- Frames: one-click manual snapshots and automatic scene-detection
	(PySceneDetect) de-duplicated with perceptual hashing.

	### How a frame is chosen for each step

	Relevancy is decided by combining accurate signals (no single model "judges" it):

	1. Timestamp alignment — the frame on screen while that step was narrated
	(Whisper timestamps ↔ step time). The strongest signal for tutorials.
	2. Sharpness — variance-of-Laplacian, to avoid blurry scene-transition frames.
	3. BLIP caption match — the BLIP caption is compared to the step's text; a frame
	whose description overlaps the step gets a nudge. This suggests, it doesn't decide.
	4. Manual preference — frames you snapshot yourself win ties.

	See `_pick_frame` in [src/guide.py](src/guide.py).

	## Requirements

	- Python 3.11
	- ffmpeg on your `PATH` (`ffmpeg -version` should work)
	- A HuggingFace token — get one at <https://huggingface.co/settings/tokens>
	(a free Read token works). You paste it into the app's UI; it is not read
	from the environment. (The headless smoke test reads it from `HF_TOKEN` /
	`HUGGINGFACEHUB_API_TOKEN` instead.)
	- A CUDA GPU is optional (Whisper falls back to CPU automatically)

	## Setup

	```bash
	python -m venv .venv
	# Windows (PowerShell): .venv\Scripts\Activate.ps1
	# Git Bash: source .venv/Scripts/activate
	pip install -r requirements.txt

	cp .env.example .env # optional — tweak model ids / Whisper size
	```

	`requirements.txt` includes the local BLIP captioner (`torch` + `transformers`,
	CPU build). The BLIP model (~1 GB) downloads on first use and is cached. Captions
	are nice-to-have: without a captioner you still get images + step text, and frame
	selection uses timestamp + sharpness only.

	To run BLIP on a local NVIDIA GPU, reinstall a CUDA build of torch — it's used
	automatically when available:

	```bash
	pip install torch --index-url https://download.pytorch.org/whl/cu124
	```

	### HuggingFace token

	The app takes your token from the UI — paste it into the "🔑 HuggingFace token"
	box at the top of the page. When you enter it (local single-user mode), the app
	sets it as the process `HF_TOKEN` (in memory only, never written to disk) so every
	HuggingFace operation — the guide LLM and model downloads — authenticates with it,
	overriding any stale token already in your environment.

	Shared / multi-user deployments: if you launch with `DOCUMAKER_SHARE=1` or a
	non-localhost `DOCUMAKER_SERVER_NAME`, the app switches to per-session tokens and
	does not touch the global `HF_TOKEN` — so one user's token can never leak to
	another. The token is still threaded directly to that user's LLM/caption calls (the
	guide LLM clients are created per request, never cached across sessions).

	The headless smoke test (`scripts/smoke_test.py`) instead reads the token from
	`HF_TOKEN` / `HUGGINGFACEHUB_API_TOKEN` and validates which one actually
	authenticates (handy if one is stale).

	## Run

	```bash
	python app.py
	```

	Open the printed local URL, then:

	1. Paste your HuggingFace token into the 🔑 box at the top.
	2. Upload & preview a video.
	3. Capture frames — scrub the seek bar and click 📸 Capture current frame,
	and/or click ✨ Auto-extract frames for scene-based snapshots.
	4. Transcribe audio (Whisper, local). Edit the transcript if you like.
	5. Generate step-by-step guide (HF LLM) and review the steps.
	6. Build DOCX — images are matched to steps and captioned, then download `guide.docx`.

	## Quick backend check (no UI)

	```bash
	python scripts/make_sample.py # synthesizes work/sample/sample.mp4 (4 scenes + narration)
	python scripts/smoke_test.py # runs the full pipeline and asserts a valid DOCX
	```

	The smoke test falls back to a naive draft if the HF API is unreachable, so DOCX
	assembly is always exercised.

	## Deploy to HuggingFace Spaces

	This repo is Spaces-ready: the YAML header in this README, `packages.txt` (installs
	`ffmpeg`), and a single `requirements.txt` are all the platform needs. The app
	auto-detects Spaces (via `SPACE_ID`), binds to `0.0.0.0`, and switches to
	multi-user-safe tokens — each visitor pastes their own HF token in the UI and it
	never touches the shared environment.

	1. Create a new Gradio Space at <https://huggingface.co/new-space>.
	2. Push this repo to it:
	```bash
	git init && git add -A && git commit -m "DocuMaker"
	git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
	git push space main
	```
	(Or drag the files into the Space's Files tab.)
	3. The Space builds and starts automatically. Visitors paste their own HF token —
	you don't expose yours or share its quota.

	Notes for the free CPU tier: transcription and BLIP run on CPU (slower but fine for a
	demo); set `DOCUMAKER_WHISPER_MODEL=base` in the Space Settings → Variables for
	snappier transcription. You can push to both GitHub and the Space (add both as
	git remotes).

	## Configuration

	All settings are environment variables (see [.env.example](.env.example)). Highlights:

	\| Variable \| Default \| Purpose \|
	\|---\|---\|---\|
	\| `DOCUMAKER_LLM_MODEL` \| `Qwen/Qwen2.5-7B-Instruct` \| Text LLM (any HF instruct model) \|
	\| `DOCUMAKER_VLM_MODEL` \| `Qwen/Qwen2-VL-7B-Instruct` \| API vision model tried before local BLIP \|
	\| `DOCUMAKER_LOCAL_CAPTION_MODEL` \| `Salesforce/blip-image-captioning-base` \| Local captioner \|
	\| `DOCUMAKER_ENABLE_VISION` \| `1` \| Set `0` to skip captioning \|
	\| `DOCUMAKER_WHISPER_MODEL` \| `small` \| `tiny`…`large-v3` \|
	\| `DOCUMAKER_WHISPER_DEVICE` \| `auto` \| `auto` / `cuda` / `cpu` \|
	\| `DOCUMAKER_SCENE_THRESHOLD` \| `27.0` \| Lower = more auto-frames \|
	\| `DOCUMAKER_SHARE` \| `0` \| `1` = Gradio public share link (enables multi-user-safe tokens) \|
	\| `DOCUMAKER_SERVER_NAME` \| `127.0.0.1` \| Bind address; non-localhost enables multi-user-safe tokens \|

	## Troubleshooting

	- Whisper CUDA errors / cuDNN not found: faster-whisper (CTranslate2) needs the
	NVIDIA CUDA libraries. Either install them —
	`pip install nvidia-cublas-cu12 nvidia-cudnn-cu12` — or force CPU with
	`DOCUMAKER_WHISPER_DEVICE=cpu` (slower but always works).
	- LLM call failed / model not available: free-tier model availability changes.
	Set `DOCUMAKER_LLM_MODEL` to another available instruct model, or pin a provider
	with `DOCUMAKER_LLM_PROVIDER`.
	- No captions in the DOCX: the API vision model usually isn't served on free HF
	accounts ("not supported by any provider you have enabled"), so DocuMaker uses
	local BLIP (installed via `requirements.txt`). Frame selection and images still
	work without it.
	- Video doesn't preview: the app serves files from `work/` via Gradio's
	`allowed_paths` (URL prefix `/gradio_api/file=`). Make sure the upload completed;
	very large files take a moment.
	- No images in the DOCX: capture or auto-extract frames before Build DOCX.
	Steps with a timestamp but no nearby frame pull a fresh frame from the video.

	## Project layout

	```
	app.py Gradio UI + event wiring
	src/config.py env-driven settings
	src/video.py ffmpeg audio extract / duration / frame@timestamp
	src/frames.py scene detection, dedup, manual-capture decode
	src/transcribe.py faster-whisper (CUDA→CPU fallback)
	src/llm.py HF Inference: transcript → structured step JSON
	src/vision.py VLM captioning (HF API) + local BLIP fallback
	src/guide.py align frames↔steps, caption
	src/docx_export.py python-docx assembly
	src/web/player.html custom HTML5 player for seek + snapshot
	scripts/make_sample.py synthesize a test clip
	scripts/smoke_test.py headless end-to-end check
	```