title: Physically Based Portrait Mode Engine
emoji: π
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 6.19.0
python_version: '3.13'
app_file: app.py
pinned: false
short_description: An improved portrait mode renderer for pictures.
Physically-Based Portrait Mode Engine
A Gradio demo that simulates shallow depth-of-field (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a circle-of-confusion (CoC) map, and uses a trained neural renderer (RendererNet) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image.
The app is designed to run as a Hugging Face Space and is self-contained: model definitions and inference code live in a single app.py file.
What it does
Given an input photograph and camera parameters (f-stop, focal length), the pipeline:
- Estimates relative depth with Depth Anything V2 via the Hugging Face
transformerslibrary. - Builds a pseudo CoC map by measuring how far each pixel's depth deviates from the depth at the image center (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values.
- Renders defocus with RendererNet, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings.
- Blends the NN render onto the original using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched.
- Produces a non-NN baseline for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold.
Three outputs are shown:
| Output | Description |
|---|---|
| Rendered (NN) | RendererNet defocus, blended with the original in in-focus regions |
| Gaussian baseline | Simple background blur for comparison against the learned renderer |
| Pseudo CoC map | Colorized circle-of-confusion visualization (inferno colormap) |
Pipeline overview
Input image
β
βΌ
Resize (longest side β€ 1024 px)
β
βΌ
Depth Anything V2 βββΊ relative depth [0, 1]
β
βΌ
Pseudo CoC map (focus = image center, max 4 px)
β
ββββββββββββββββββββββββββββββββ
βΌ βΌ
RendererNet (512Γ512) Gaussian blur baseline
f-stop + focal length + CoC (CoC > threshold)
β β
βΌ β
CoC-weighted blend βββββββββββββββββ
β
βΌ
Final outputs (NN render, baseline, CoC map)
Pseudo CoC
Because the demo has no interactive focus point, focus is fixed at the center of the image (h // 2, w // 2). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of 4 px (COC_MAX_PX).
RendererNet input
RendererNet is a U-Net with 6 input channels and 3 output channels (RGB):
| Channel(s) | Content |
|---|---|
| 0β2 | RGB image, resized to 512Γ512 |
| 3 | f-stop map, normalized by F_STOP_MAX (22.0) |
| 4 | Focal length map, normalized by FOCAL_LENGTH_MM_MAX (200.0) |
| 5 | CoC map, clipped to [0, 25] px and normalized by COC_PX_NORM (25.0) |
The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the in-focus CoC threshold slider.
Blending
The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates.
Models
Depth Anything V2
- Default checkpoint:
depth-anything/Depth-Anything-V2-Base-hf - Loaded at startup via
AutoImageProcessorandAutoModelForDepthEstimationfromtransformers. - Returns relative (not metric) depth, normalized per image to [0, 1].
- Requires torch and torchvision.
Swap to a different variant via the DEPTH_MODEL_ID environment variable:
| Model ID | Trade-off |
|---|---|
depth-anything/Depth-Anything-V2-Small-hf |
Faster, lower quality β good for CPU Spaces |
depth-anything/Depth-Anything-V2-Base-hf |
Default; balanced speed and quality |
depth-anything/Depth-Anything-V2-Large-hf |
Best quality, slowest |
RendererNet
- A U-Net architecture (inlined in
app.py) trained to render defocus given RGB + camera parameters + CoC. - Weights are loaded from
renderer/best_renderer.pthby default. - Can alternatively be fetched from the Hugging Face Hub when
RENDERER_REPO_IDis set.
Using the demo
- Upload an image (JPEG or PNG).
- Adjust f-stop (0.95β22.0) β lower values produce stronger background blur.
- Adjust focal length (4β200 mm) β longer focal lengths increase the shallow-DoF effect.
- Tune in-focus CoC threshold β pixels with CoC below this value suppress the NN render and keep the original sharp.
- Optionally expand Gaussian baseline to configure the comparison blur (CoC threshold and sigma).
- Click Render.
If a cache/ directory with example images exists, sample images appear in the Examples panel.
Configuration
All settings below can be overridden with environment variables (useful for Hugging Face Space Settings β Repository secrets / Variables without editing code).
| Variable | Default | Description |
|---|---|---|
DEPTH_MODEL_ID |
depth-anything/Depth-Anything-V2-Base-hf |
Hugging Face model ID for depth estimation |
RENDERER_LOCAL_PATH |
renderer/best_renderer.pth |
Path to RendererNet weights on disk |
RENDERER_REPO_ID |
(empty) | Hugging Face repo to download weights from |
RENDERER_FILENAME |
best_renderer.pth |
Filename within the Hub repo |
HF_TOKEN |
(empty) | Token for private Hub repos (only needed with RENDERER_REPO_ID) |
Internal constants
These are fixed in code and must match RendererNet's training setup:
| Constant | Value | Purpose |
|---|---|---|
F_STOP_MAX |
22.0 | f-stop normalization divisor |
FOCAL_LENGTH_MM_MAX |
200.0 | Focal length normalization divisor |
COC_PX_NORM |
25.0 | CoC channel clip and normalize |
TARGET_SIZE |
512 | RendererNet spatial resolution |
COC_MAX_PX |
4.0 | Maximum pseudo CoC magnitude |
MAX_SIDE |
1024 | Longest image side before inference |
Local development
Requirements
- Python 3.10+ (3.13 on Hugging Face Spaces)
- CUDA optional (falls back to CPU)
Install and run
git clone https://github.com/<your-username>/Physically-Based-Portrait-Mode-Engine.git
cd Physically-Based-Portrait-Mode-Engine
pip install -r requirements.txt
# Ensure RendererNet weights are present
# (renderer/best_renderer.pth is included in the repo)
python app.py
Gradio will print a local URL (typically http://127.0.0.1:7860).
Dependencies
gradio>=4.44.0
torch>=2.1.0
torchvision>=0.16.0
transformers>=4.45.0
huggingface_hub>=0.24.0
numpy>=1.26.0
pillow>=10.0.0
scikit-image>=0.22.0
matplotlib>=3.7.0
On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model).
Quantitative evaluation: edge artifacts
quantitative-tests/benchmark_edge_artifacts.py runs the same fixed f/1.2 pipeline as prototype.py on the four JPGs in cache/ and compares edge artifacts against the original image for two compositing methods:
- Flat Gaussian background β uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask).
- CoC-weighted NN render β RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched).
Metrics are computed in three regions where compositing artifacts show up most: in-focus pixels (CoC β€ 0.4), the transition band (0.4 < CoC β€ 1.0), and a boundary ring (Β±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in quantitative-tests/results/edge_artifact_benchmark.json.
Setup: f/1.2, focal length 6.765 mm, Gaussian Ο = 12 px, images resized to max side 768 px, focus at image center.
| Metric (avg. across 4 images) | Gaussian vs original | NN render vs original | NN improvement |
|---|---|---|---|
| Boundary ring mean abs diff | 0.066 | 0.007 | 9.3Γ |
| Boundary ring mean grad excess | 0.044 | 0.005 | 8.2Γ |
| Global mean grad excess | 0.025 | 0.009 | 2.8Γ |
| In-focus mean grad excess | 0.0008 | 0.00009 | 8.9Γ |
Takeaways:
- The hard Gaussian mask produces strong halos at the focus/defocus boundary β boundary-ring pixel error averages ~0.066 vs ~0.007 for the NN (roughly 6β21Γ better per image).
- In-focus regions stay clean for both methods (mean abs diff β 0), but the NN leaks less spurious edge energy even there (~9Γ lower in-focus grad excess on average).
- The NNβs smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor.
Re-run the benchmark:
python quantitative-tests/benchmark_edge_artifacts.py
Project structure
.
βββ app.py # Gradio app, RendererNet definition, full inference pipeline
βββ requirements.txt # Python dependencies
βββ renderer/
β βββ best_renderer.pth # Trained RendererNet weights
βββ cache/ # Optional example images for Gradio Examples
Limitations
- Fixed focus point: Focus is always the image center; there is no click-to-focus or subject detection.
- Relative depth only: Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation.
- Resolution cap: Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive.
- RendererNet runs at 512Γ512: Fine detail may be softened; output is upsampled to the working resolution.
- CPU Spaces are slow: First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround.
Acknowledgments
- Depth Anything V2 β monocular depth estimation (Lihe Yang et al.)
- Hugging Face Transformers β model loading and inference
- Gradio β web UI