Spaces:

tejaswitripathi
/

Physically-Based-Portrait-Mode-Engine

Running

App Files Files Community

Physically-Based-Portrait-Mode-Engine / README.md

Tejaswi Tripathi

Quantitative tests

13d40ab 7 days ago

preview code

Raw

History Blame Contribute Delete

11 kB

metadata

title: Physically Based Portrait Mode Engine
emoji: 🐠
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 6.19.0
python_version: '3.13'
app_file: app.py
pinned: false
short_description: An improved portrait mode renderer for pictures.

Physically-Based Portrait Mode Engine

A Gradio demo that simulates shallow depth-of-field (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a circle-of-confusion (CoC) map, and uses a trained neural renderer (RendererNet) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image.

The app is designed to run as a Hugging Face Space and is self-contained: model definitions and inference code live in a single app.py file.

What it does

Given an input photograph and camera parameters (f-stop, focal length), the pipeline:

Estimates relative depth with Depth Anything V2 via the Hugging Face transformers library.
Builds a pseudo CoC map by measuring how far each pixel's depth deviates from the depth at the image center (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values.
Renders defocus with RendererNet, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings.
Blends the NN render onto the original using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched.
Produces a non-NN baseline for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold.

Three outputs are shown:

Output	Description
Rendered (NN)	RendererNet defocus, blended with the original in in-focus regions
Gaussian baseline	Simple background blur for comparison against the learned renderer
Pseudo CoC map	Colorized circle-of-confusion visualization (inferno colormap)

Pipeline overview

Input image
    │
    ▼
Resize (longest side ≤ 1024 px)
    │
    ▼
Depth Anything V2  ──►  relative depth [0, 1]
    │
    ▼
Pseudo CoC map  (focus = image center, max 4 px)
    │
    ├──────────────────────────────┐
    ▼                              ▼
RendererNet (512×512)         Gaussian blur baseline
f-stop + focal length + CoC   (CoC > threshold)
    │                              │
    ▼                              │
CoC-weighted blend ◄───────────────┘
    │
    ▼
Final outputs (NN render, baseline, CoC map)

Pseudo CoC

Because the demo has no interactive focus point, focus is fixed at the center of the image (h // 2, w // 2). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of 4 px (COC_MAX_PX).

RendererNet input

RendererNet is a U-Net with 6 input channels and 3 output channels (RGB):

Channel(s)	Content
0–2	RGB image, resized to 512×512
3	f-stop map, normalized by `F_STOP_MAX` (22.0)
4	Focal length map, normalized by `FOCAL_LENGTH_MM_MAX` (200.0)
5	CoC map, clipped to [0, 25] px and normalized by `COC_PX_NORM` (25.0)

The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the in-focus CoC threshold slider.

Blending

The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates.

Models

Depth Anything V2

Default checkpoint: depth-anything/Depth-Anything-V2-Base-hf
Loaded at startup via AutoImageProcessor and AutoModelForDepthEstimation from transformers.
Returns relative (not metric) depth, normalized per image to [0, 1].
Requires torch and torchvision.

Swap to a different variant via the DEPTH_MODEL_ID environment variable:

Model ID	Trade-off
`depth-anything/Depth-Anything-V2-Small-hf`	Faster, lower quality — good for CPU Spaces
`depth-anything/Depth-Anything-V2-Base-hf`	Default; balanced speed and quality
`depth-anything/Depth-Anything-V2-Large-hf`	Best quality, slowest

RendererNet

A U-Net architecture (inlined in app.py) trained to render defocus given RGB + camera parameters + CoC.
Weights are loaded from renderer/best_renderer.pth by default.
Can alternatively be fetched from the Hugging Face Hub when RENDERER_REPO_ID is set.

Using the demo

Upload an image (JPEG or PNG).
Adjust f-stop (0.95–22.0) — lower values produce stronger background blur.
Adjust focal length (4–200 mm) — longer focal lengths increase the shallow-DoF effect.
Tune in-focus CoC threshold — pixels with CoC below this value suppress the NN render and keep the original sharp.
Optionally expand Gaussian baseline to configure the comparison blur (CoC threshold and sigma).
Click Render.

If a cache/ directory with example images exists, sample images appear in the Examples panel.

Configuration

All settings below can be overridden with environment variables (useful for Hugging Face Space Settings → Repository secrets / Variables without editing code).

Variable	Default	Description
`DEPTH_MODEL_ID`	`depth-anything/Depth-Anything-V2-Base-hf`	Hugging Face model ID for depth estimation
`RENDERER_LOCAL_PATH`	`renderer/best_renderer.pth`	Path to RendererNet weights on disk
`RENDERER_REPO_ID`	(empty)	Hugging Face repo to download weights from
`RENDERER_FILENAME`	`best_renderer.pth`	Filename within the Hub repo
`HF_TOKEN`	(empty)	Token for private Hub repos (only needed with `RENDERER_REPO_ID`)

Internal constants

These are fixed in code and must match RendererNet's training setup:

Constant	Value	Purpose
`F_STOP_MAX`	22.0	f-stop normalization divisor
`FOCAL_LENGTH_MM_MAX`	200.0	Focal length normalization divisor
`COC_PX_NORM`	25.0	CoC channel clip and normalize
`TARGET_SIZE`	512	RendererNet spatial resolution
`COC_MAX_PX`	4.0	Maximum pseudo CoC magnitude
`MAX_SIDE`	1024	Longest image side before inference

Local development

Requirements

Python 3.10+ (3.13 on Hugging Face Spaces)
CUDA optional (falls back to CPU)

Install and run

git clone https://github.com/<your-username>/Physically-Based-Portrait-Mode-Engine.git
cd Physically-Based-Portrait-Mode-Engine

pip install -r requirements.txt

# Ensure RendererNet weights are present
# (renderer/best_renderer.pth is included in the repo)

python app.py

Gradio will print a local URL (typically http://127.0.0.1:7860).

Dependencies

gradio>=4.44.0
torch>=2.1.0
torchvision>=0.16.0
transformers>=4.45.0
huggingface_hub>=0.24.0
numpy>=1.26.0
pillow>=10.0.0
scikit-image>=0.22.0
matplotlib>=3.7.0

On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model).

Quantitative evaluation: edge artifacts

quantitative-tests/benchmark_edge_artifacts.py runs the same fixed f/1.2 pipeline as prototype.py on the four JPGs in cache/ and compares edge artifacts against the original image for two compositing methods:

Flat Gaussian background — uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask).
CoC-weighted NN render — RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched).

Metrics are computed in three regions where compositing artifacts show up most: in-focus pixels (CoC ≤ 0.4), the transition band (0.4 < CoC ≤ 1.0), and a boundary ring (±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in quantitative-tests/results/edge_artifact_benchmark.json.

Setup: f/1.2, focal length 6.765 mm, Gaussian σ = 12 px, images resized to max side 768 px, focus at image center.

Metric (avg. across 4 images)	Gaussian vs original	NN render vs original	NN improvement
Boundary ring mean abs diff	0.066	0.007	9.3×
Boundary ring mean grad excess	0.044	0.005	8.2×
Global mean grad excess	0.025	0.009	2.8×
In-focus mean grad excess	0.0008	0.00009	8.9×

Takeaways:

The hard Gaussian mask produces strong halos at the focus/defocus boundary — boundary-ring pixel error averages ~0.066 vs ~0.007 for the NN (roughly 6–21× better per image).
In-focus regions stay clean for both methods (mean abs diff ≈ 0), but the NN leaks less spurious edge energy even there (~9× lower in-focus grad excess on average).
The NN’s smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor.

Re-run the benchmark:

python quantitative-tests/benchmark_edge_artifacts.py

Project structure

.
├── app.py                  # Gradio app, RendererNet definition, full inference pipeline
├── requirements.txt        # Python dependencies
├── renderer/
│   └── best_renderer.pth   # Trained RendererNet weights
└── cache/                  # Optional example images for Gradio Examples

Limitations

Fixed focus point: Focus is always the image center; there is no click-to-focus or subject detection.
Relative depth only: Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation.
Resolution cap: Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive.
RendererNet runs at 512×512: Fine detail may be softened; output is upsampled to the working resolution.
CPU Spaces are slow: First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround.

Acknowledgments

Depth Anything V2 — monocular depth estimation (Lihe Yang et al.)
Hugging Face Transformers — model loading and inference
Gradio — web UI