Spaces:

tejaswitripathi
/

Physically-Based-Portrait-Mode-Engine

Running

App Files Files Community

Physically-Based-Portrait-Mode-Engine / README.md

Tejaswi Tripathi

Quantitative tests

13d40ab 7 days ago

preview code

Raw

History Blame Contribute Delete

11 kB

	---
	title: Physically Based Portrait Mode Engine
	emoji: 🐠
	colorFrom: gray
	colorTo: blue
	sdk: gradio
	sdk_version: 6.19.0
	python_version: '3.13'
	app_file: app.py
	pinned: false
	short_description: An improved portrait mode renderer for pictures.
	---

	# Physically-Based Portrait Mode Engine

	A Gradio demo that simulates shallow depth-of-field (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a circle-of-confusion (CoC) map, and uses a trained neural renderer (RendererNet) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image.

	The app is designed to run as a [Hugging Face Space](https://huggingface.co/docs/hub/spaces) and is self-contained: model definitions and inference code live in a single `app.py` file.

	## What it does

	Given an input photograph and camera parameters (f-stop, focal length), the pipeline:

	1. Estimates relative depth with [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf) via the Hugging Face `transformers` library.
	2. Builds a pseudo CoC map by measuring how far each pixel's depth deviates from the depth at the image center (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values.
	3. Renders defocus with RendererNet, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings.
	4. Blends the NN render onto the original using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched.
	5. Produces a non-NN baseline for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold.

	Three outputs are shown:

	\| Output \| Description \|
	\|--------\|-------------\|
	\| Rendered (NN) \| RendererNet defocus, blended with the original in in-focus regions \|
	\| Gaussian baseline \| Simple background blur for comparison against the learned renderer \|
	\| Pseudo CoC map \| Colorized circle-of-confusion visualization (inferno colormap) \|

	## Pipeline overview

	```
	Input image
	│
	▼
	Resize (longest side ≤ 1024 px)
	│
	▼
	Depth Anything V2 ──► relative depth [0, 1]
	│
	▼
	Pseudo CoC map (focus = image center, max 4 px)
	│
	├──────────────────────────────┐
	▼ ▼
	RendererNet (512×512) Gaussian blur baseline
	f-stop + focal length + CoC (CoC > threshold)
	│ │
	▼ │
	CoC-weighted blend ◄───────────────┘
	│
	▼
	Final outputs (NN render, baseline, CoC map)
	```

	### Pseudo CoC

	Because the demo has no interactive focus point, focus is fixed at the center of the image (`h // 2`, `w // 2`). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of 4 px (`COC_MAX_PX`).

	### RendererNet input

	RendererNet is a U-Net with 6 input channels and 3 output channels (RGB):

	\| Channel(s) \| Content \|
	\|------------\|---------\|
	\| 0–2 \| RGB image, resized to 512×512 \|
	\| 3 \| f-stop map, normalized by `F_STOP_MAX` (22.0) \|
	\| 4 \| Focal length map, normalized by `FOCAL_LENGTH_MM_MAX` (200.0) \|
	\| 5 \| CoC map, clipped to [0, 25] px and normalized by `COC_PX_NORM` (25.0) \|

	The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the in-focus CoC threshold slider.

	### Blending

	The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates.

	## Models

	### Depth Anything V2

	- Default checkpoint: `depth-anything/Depth-Anything-V2-Base-hf`
	- Loaded at startup via `AutoImageProcessor` and `AutoModelForDepthEstimation` from `transformers`.
	- Returns relative (not metric) depth, normalized per image to [0, 1].
	- Requires torch and torchvision.

	Swap to a different variant via the `DEPTH_MODEL_ID` environment variable:

	\| Model ID \| Trade-off \|
	\|----------\|-----------\|
	\| `depth-anything/Depth-Anything-V2-Small-hf` \| Faster, lower quality — good for CPU Spaces \|
	\| `depth-anything/Depth-Anything-V2-Base-hf` \| Default; balanced speed and quality \|
	\| `depth-anything/Depth-Anything-V2-Large-hf` \| Best quality, slowest \|

	### RendererNet

	- A U-Net architecture (inlined in `app.py`) trained to render defocus given RGB + camera parameters + CoC.
	- Weights are loaded from `renderer/best_renderer.pth` by default.
	- Can alternatively be fetched from the Hugging Face Hub when `RENDERER_REPO_ID` is set.

	## Using the demo

	1. Upload an image (JPEG or PNG).
	2. Adjust f-stop (0.95–22.0) — lower values produce stronger background blur.
	3. Adjust focal length (4–200 mm) — longer focal lengths increase the shallow-DoF effect.
	4. Tune in-focus CoC threshold — pixels with CoC below this value suppress the NN render and keep the original sharp.
	5. Optionally expand Gaussian baseline to configure the comparison blur (CoC threshold and sigma).
	6. Click Render.

	If a `cache/` directory with example images exists, sample images appear in the Examples panel.

	## Configuration

	All settings below can be overridden with environment variables (useful for Hugging Face Space Settings → Repository secrets / Variables without editing code).

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `DEPTH_MODEL_ID` \| `depth-anything/Depth-Anything-V2-Base-hf` \| Hugging Face model ID for depth estimation \|
	\| `RENDERER_LOCAL_PATH` \| `renderer/best_renderer.pth` \| Path to RendererNet weights on disk \|
	\| `RENDERER_REPO_ID` \| (empty) \| Hugging Face repo to download weights from \|
	\| `RENDERER_FILENAME` \| `best_renderer.pth` \| Filename within the Hub repo \|
	\| `HF_TOKEN` \| (empty) \| Token for private Hub repos (only needed with `RENDERER_REPO_ID`) \|

	### Internal constants

	These are fixed in code and must match RendererNet's training setup:

	\| Constant \| Value \| Purpose \|
	\|----------\|-------\|---------\|
	\| `F_STOP_MAX` \| 22.0 \| f-stop normalization divisor \|
	\| `FOCAL_LENGTH_MM_MAX` \| 200.0 \| Focal length normalization divisor \|
	\| `COC_PX_NORM` \| 25.0 \| CoC channel clip and normalize \|
	\| `TARGET_SIZE` \| 512 \| RendererNet spatial resolution \|
	\| `COC_MAX_PX` \| 4.0 \| Maximum pseudo CoC magnitude \|
	\| `MAX_SIDE` \| 1024 \| Longest image side before inference \|

	## Local development

	### Requirements

	- Python 3.10+ (3.13 on Hugging Face Spaces)
	- CUDA optional (falls back to CPU)

	### Install and run

	```bash
	git clone https://github.com/<your-username>/Physically-Based-Portrait-Mode-Engine.git
	cd Physically-Based-Portrait-Mode-Engine

	pip install -r requirements.txt

	# Ensure RendererNet weights are present
	# (renderer/best_renderer.pth is included in the repo)

	python app.py
	```

	Gradio will print a local URL (typically `http://127.0.0.1:7860`).

	### Dependencies

	```
	gradio>=4.44.0
	torch>=2.1.0
	torchvision>=0.16.0
	transformers>=4.45.0
	huggingface_hub>=0.24.0
	numpy>=1.26.0
	pillow>=10.0.0
	scikit-image>=0.22.0
	matplotlib>=3.7.0
	```

	On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model).

	### Quantitative evaluation: edge artifacts

	`quantitative-tests/benchmark_edge_artifacts.py` runs the same fixed f/1.2 pipeline as `prototype.py` on the four JPGs in `cache/` and compares edge artifacts against the original image for two compositing methods:

	1. Flat Gaussian background — uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask).
	2. CoC-weighted NN render — RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched).

	Metrics are computed in three regions where compositing artifacts show up most: in-focus pixels (CoC ≤ 0.4), the transition band (0.4 < CoC ≤ 1.0), and a boundary ring (±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in `quantitative-tests/results/edge_artifact_benchmark.json`.

	Setup: f/1.2, focal length 6.765 mm, Gaussian σ = 12 px, images resized to max side 768 px, focus at image center.

	\| Metric (avg. across 4 images) \| Gaussian vs original \| NN render vs original \| NN improvement \|
	\|---\|---:\|---:\|---:\|
	\| Boundary ring mean abs diff \| 0.066 \| 0.007 \| 9.3× \|
	\| Boundary ring mean grad excess \| 0.044 \| 0.005 \| 8.2× \|
	\| Global mean grad excess \| 0.025 \| 0.009 \| 2.8× \|
	\| In-focus mean grad excess \| 0.0008 \| 0.00009 \| 8.9× \|

	Takeaways:

	- The hard Gaussian mask produces strong halos at the focus/defocus boundary — boundary-ring pixel error averages ~0.066 vs ~0.007 for the NN (roughly 6–21× better per image).
	- In-focus regions stay clean for both methods (mean abs diff ≈ 0), but the NN leaks less spurious edge energy even there (~9× lower in-focus grad excess on average).
	- The NN’s smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor.

	Re-run the benchmark:

	```bash
	python quantitative-tests/benchmark_edge_artifacts.py
	```

	## Project structure

	```
	.
	├── app.py # Gradio app, RendererNet definition, full inference pipeline
	├── requirements.txt # Python dependencies
	├── renderer/
	│ └── best_renderer.pth # Trained RendererNet weights
	└── cache/ # Optional example images for Gradio Examples
	```

	## Limitations

	- Fixed focus point: Focus is always the image center; there is no click-to-focus or subject detection.
	- Relative depth only: Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation.
	- Resolution cap: Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive.
	- RendererNet runs at 512×512: Fine detail may be softened; output is upsampled to the working resolution.
	- CPU Spaces are slow: First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround.

	## Acknowledgments

	- [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) — monocular depth estimation ([Lihe Yang et al.](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf))
	- [Hugging Face Transformers](https://github.com/huggingface/transformers) — model loading and inference
	- [Gradio](https://gradio.app/) — web UI