Tejaswi Tripathi
Quantitative tests
13d40ab
|
Raw
History Blame Contribute Delete
11 kB
metadata
title: Physically Based Portrait Mode Engine
emoji: 🐠
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 6.19.0
python_version: '3.13'
app_file: app.py
pinned: false
short_description: An improved portrait mode renderer for pictures.

Physically-Based Portrait Mode Engine

A Gradio demo that simulates shallow depth-of-field (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a circle-of-confusion (CoC) map, and uses a trained neural renderer (RendererNet) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image.

The app is designed to run as a Hugging Face Space and is self-contained: model definitions and inference code live in a single app.py file.

What it does

Given an input photograph and camera parameters (f-stop, focal length), the pipeline:

  1. Estimates relative depth with Depth Anything V2 via the Hugging Face transformers library.
  2. Builds a pseudo CoC map by measuring how far each pixel's depth deviates from the depth at the image center (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values.
  3. Renders defocus with RendererNet, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings.
  4. Blends the NN render onto the original using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched.
  5. Produces a non-NN baseline for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold.

Three outputs are shown:

Output Description
Rendered (NN) RendererNet defocus, blended with the original in in-focus regions
Gaussian baseline Simple background blur for comparison against the learned renderer
Pseudo CoC map Colorized circle-of-confusion visualization (inferno colormap)

Pipeline overview

Input image
    β”‚
    β–Ό
Resize (longest side ≀ 1024 px)
    β”‚
    β–Ό
Depth Anything V2  ──►  relative depth [0, 1]
    β”‚
    β–Ό
Pseudo CoC map  (focus = image center, max 4 px)
    β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό                              β–Ό
RendererNet (512Γ—512)         Gaussian blur baseline
f-stop + focal length + CoC   (CoC > threshold)
    β”‚                              β”‚
    β–Ό                              β”‚
CoC-weighted blend β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
Final outputs (NN render, baseline, CoC map)

Pseudo CoC

Because the demo has no interactive focus point, focus is fixed at the center of the image (h // 2, w // 2). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of 4 px (COC_MAX_PX).

RendererNet input

RendererNet is a U-Net with 6 input channels and 3 output channels (RGB):

Channel(s) Content
0–2 RGB image, resized to 512Γ—512
3 f-stop map, normalized by F_STOP_MAX (22.0)
4 Focal length map, normalized by FOCAL_LENGTH_MM_MAX (200.0)
5 CoC map, clipped to [0, 25] px and normalized by COC_PX_NORM (25.0)

The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the in-focus CoC threshold slider.

Blending

The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates.

Models

Depth Anything V2

  • Default checkpoint: depth-anything/Depth-Anything-V2-Base-hf
  • Loaded at startup via AutoImageProcessor and AutoModelForDepthEstimation from transformers.
  • Returns relative (not metric) depth, normalized per image to [0, 1].
  • Requires torch and torchvision.

Swap to a different variant via the DEPTH_MODEL_ID environment variable:

Model ID Trade-off
depth-anything/Depth-Anything-V2-Small-hf Faster, lower quality β€” good for CPU Spaces
depth-anything/Depth-Anything-V2-Base-hf Default; balanced speed and quality
depth-anything/Depth-Anything-V2-Large-hf Best quality, slowest

RendererNet

  • A U-Net architecture (inlined in app.py) trained to render defocus given RGB + camera parameters + CoC.
  • Weights are loaded from renderer/best_renderer.pth by default.
  • Can alternatively be fetched from the Hugging Face Hub when RENDERER_REPO_ID is set.

Using the demo

  1. Upload an image (JPEG or PNG).
  2. Adjust f-stop (0.95–22.0) β€” lower values produce stronger background blur.
  3. Adjust focal length (4–200 mm) β€” longer focal lengths increase the shallow-DoF effect.
  4. Tune in-focus CoC threshold β€” pixels with CoC below this value suppress the NN render and keep the original sharp.
  5. Optionally expand Gaussian baseline to configure the comparison blur (CoC threshold and sigma).
  6. Click Render.

If a cache/ directory with example images exists, sample images appear in the Examples panel.

Configuration

All settings below can be overridden with environment variables (useful for Hugging Face Space Settings β†’ Repository secrets / Variables without editing code).

Variable Default Description
DEPTH_MODEL_ID depth-anything/Depth-Anything-V2-Base-hf Hugging Face model ID for depth estimation
RENDERER_LOCAL_PATH renderer/best_renderer.pth Path to RendererNet weights on disk
RENDERER_REPO_ID (empty) Hugging Face repo to download weights from
RENDERER_FILENAME best_renderer.pth Filename within the Hub repo
HF_TOKEN (empty) Token for private Hub repos (only needed with RENDERER_REPO_ID)

Internal constants

These are fixed in code and must match RendererNet's training setup:

Constant Value Purpose
F_STOP_MAX 22.0 f-stop normalization divisor
FOCAL_LENGTH_MM_MAX 200.0 Focal length normalization divisor
COC_PX_NORM 25.0 CoC channel clip and normalize
TARGET_SIZE 512 RendererNet spatial resolution
COC_MAX_PX 4.0 Maximum pseudo CoC magnitude
MAX_SIDE 1024 Longest image side before inference

Local development

Requirements

  • Python 3.10+ (3.13 on Hugging Face Spaces)
  • CUDA optional (falls back to CPU)

Install and run

git clone https://github.com/<your-username>/Physically-Based-Portrait-Mode-Engine.git
cd Physically-Based-Portrait-Mode-Engine

pip install -r requirements.txt

# Ensure RendererNet weights are present
# (renderer/best_renderer.pth is included in the repo)

python app.py

Gradio will print a local URL (typically http://127.0.0.1:7860).

Dependencies

gradio>=4.44.0
torch>=2.1.0
torchvision>=0.16.0
transformers>=4.45.0
huggingface_hub>=0.24.0
numpy>=1.26.0
pillow>=10.0.0
scikit-image>=0.22.0
matplotlib>=3.7.0

On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model).

Quantitative evaluation: edge artifacts

quantitative-tests/benchmark_edge_artifacts.py runs the same fixed f/1.2 pipeline as prototype.py on the four JPGs in cache/ and compares edge artifacts against the original image for two compositing methods:

  1. Flat Gaussian background β€” uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask).
  2. CoC-weighted NN render β€” RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched).

Metrics are computed in three regions where compositing artifacts show up most: in-focus pixels (CoC ≀ 0.4), the transition band (0.4 < CoC ≀ 1.0), and a boundary ring (Β±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in quantitative-tests/results/edge_artifact_benchmark.json.

Setup: f/1.2, focal length 6.765 mm, Gaussian Οƒ = 12 px, images resized to max side 768 px, focus at image center.

Metric (avg. across 4 images) Gaussian vs original NN render vs original NN improvement
Boundary ring mean abs diff 0.066 0.007 9.3Γ—
Boundary ring mean grad excess 0.044 0.005 8.2Γ—
Global mean grad excess 0.025 0.009 2.8Γ—
In-focus mean grad excess 0.0008 0.00009 8.9Γ—

Takeaways:

  • The hard Gaussian mask produces strong halos at the focus/defocus boundary β€” boundary-ring pixel error averages ~0.066 vs ~0.007 for the NN (roughly 6–21Γ— better per image).
  • In-focus regions stay clean for both methods (mean abs diff β‰ˆ 0), but the NN leaks less spurious edge energy even there (~9Γ— lower in-focus grad excess on average).
  • The NN’s smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor.

Re-run the benchmark:

python quantitative-tests/benchmark_edge_artifacts.py

Project structure

.
β”œβ”€β”€ app.py                  # Gradio app, RendererNet definition, full inference pipeline
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ renderer/
β”‚   └── best_renderer.pth   # Trained RendererNet weights
└── cache/                  # Optional example images for Gradio Examples

Limitations

  • Fixed focus point: Focus is always the image center; there is no click-to-focus or subject detection.
  • Relative depth only: Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation.
  • Resolution cap: Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive.
  • RendererNet runs at 512Γ—512: Fine detail may be softened; output is upsampled to the working resolution.
  • CPU Spaces are slow: First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround.

Acknowledgments