---
title: Physically Based Portrait Mode Engine
emoji: 🐠
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 6.19.0
python_version: '3.13'
app_file: app.py
pinned: false
short_description: An improved portrait mode renderer for pictures.
---

# Physically-Based Portrait Mode Engine

A Gradio demo that simulates **shallow depth-of-field** (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a **circle-of-confusion (CoC)** map, and uses a trained neural renderer (**RendererNet**) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image.

The app is designed to run as a [Hugging Face Space](https://huggingface.co/docs/hub/spaces) and is self-contained: model definitions and inference code live in a single `app.py` file.

## What it does

Given an input photograph and camera parameters (f-stop, focal length), the pipeline:

1. **Estimates relative depth** with [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf) via the Hugging Face `transformers` library.
2. **Builds a pseudo CoC map** by measuring how far each pixel's depth deviates from the depth at the **image center** (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values.
3. **Renders defocus with RendererNet**, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings.
4. **Blends the NN render onto the original** using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched.
5. **Produces a non-NN baseline** for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold.

Three outputs are shown:

| Output | Description |
|--------|-------------|
| **Rendered (NN)** | RendererNet defocus, blended with the original in in-focus regions |
| **Gaussian baseline** | Simple background blur for comparison against the learned renderer |
| **Pseudo CoC map** | Colorized circle-of-confusion visualization (inferno colormap) |

## Pipeline overview

```
Input image
    │
    ▼
Resize (longest side ≤ 1024 px)
    │
    ▼
Depth Anything V2  ──►  relative depth [0, 1]
    │
    ▼
Pseudo CoC map  (focus = image center, max 4 px)
    │
    ├──────────────────────────────┐
    ▼                              ▼
RendererNet (512×512)         Gaussian blur baseline
f-stop + focal length + CoC   (CoC > threshold)
    │                              │
    ▼                              │
CoC-weighted blend ◄───────────────┘
    │
    ▼
Final outputs (NN render, baseline, CoC map)
```

### Pseudo CoC

Because the demo has no interactive focus point, focus is fixed at the **center of the image** (`h // 2`, `w // 2`). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of **4 px** (`COC_MAX_PX`).

### RendererNet input

RendererNet is a U-Net with **6 input channels** and **3 output channels** (RGB):

| Channel(s) | Content |
|------------|---------|
| 0–2 | RGB image, resized to 512×512 |
| 3 | f-stop map, normalized by `F_STOP_MAX` (22.0) |
| 4 | Focal length map, normalized by `FOCAL_LENGTH_MM_MAX` (200.0) |
| 5 | CoC map, clipped to [0, 25] px and normalized by `COC_PX_NORM` (25.0) |

The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the **in-focus CoC threshold** slider.

### Blending

The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates.

## Models

### Depth Anything V2

- **Default checkpoint:** `depth-anything/Depth-Anything-V2-Base-hf`
- Loaded at startup via `AutoImageProcessor` and `AutoModelForDepthEstimation` from `transformers`.
- Returns relative (not metric) depth, normalized per image to [0, 1].
- Requires **torch** and **torchvision**.

Swap to a different variant via the `DEPTH_MODEL_ID` environment variable:

| Model ID | Trade-off |
|----------|-----------|
| `depth-anything/Depth-Anything-V2-Small-hf` | Faster, lower quality — good for CPU Spaces |
| `depth-anything/Depth-Anything-V2-Base-hf` | Default; balanced speed and quality |
| `depth-anything/Depth-Anything-V2-Large-hf` | Best quality, slowest |

### RendererNet

- A U-Net architecture (inlined in `app.py`) trained to render defocus given RGB + camera parameters + CoC.
- Weights are loaded from **`renderer/best_renderer.pth`** by default.
- Can alternatively be fetched from the Hugging Face Hub when `RENDERER_REPO_ID` is set.

## Using the demo

1. Upload an image (JPEG or PNG).
2. Adjust **f-stop** (0.95–22.0) — lower values produce stronger background blur.
3. Adjust **focal length** (4–200 mm) — longer focal lengths increase the shallow-DoF effect.
4. Tune **in-focus CoC threshold** — pixels with CoC below this value suppress the NN render and keep the original sharp.
5. Optionally expand **Gaussian baseline** to configure the comparison blur (CoC threshold and sigma).
6. Click **Render**.

If a `cache/` directory with example images exists, sample images appear in the Examples panel.

## Configuration

All settings below can be overridden with environment variables (useful for Hugging Face Space **Settings → Repository secrets / Variables** without editing code).

| Variable | Default | Description |
|----------|---------|-------------|
| `DEPTH_MODEL_ID` | `depth-anything/Depth-Anything-V2-Base-hf` | Hugging Face model ID for depth estimation |
| `RENDERER_LOCAL_PATH` | `renderer/best_renderer.pth` | Path to RendererNet weights on disk |
| `RENDERER_REPO_ID` | *(empty)* | Hugging Face repo to download weights from |
| `RENDERER_FILENAME` | `best_renderer.pth` | Filename within the Hub repo |
| `HF_TOKEN` | *(empty)* | Token for private Hub repos (only needed with `RENDERER_REPO_ID`) |

### Internal constants

These are fixed in code and must match RendererNet's training setup:

| Constant | Value | Purpose |
|----------|-------|---------|
| `F_STOP_MAX` | 22.0 | f-stop normalization divisor |
| `FOCAL_LENGTH_MM_MAX` | 200.0 | Focal length normalization divisor |
| `COC_PX_NORM` | 25.0 | CoC channel clip and normalize |
| `TARGET_SIZE` | 512 | RendererNet spatial resolution |
| `COC_MAX_PX` | 4.0 | Maximum pseudo CoC magnitude |
| `MAX_SIDE` | 1024 | Longest image side before inference |

## Local development

### Requirements

- Python 3.10+ (3.13 on Hugging Face Spaces)
- CUDA optional (falls back to CPU)

### Install and run

```bash
git clone https://github.com/<your-username>/Physically-Based-Portrait-Mode-Engine.git
cd Physically-Based-Portrait-Mode-Engine

pip install -r requirements.txt

# Ensure RendererNet weights are present
# (renderer/best_renderer.pth is included in the repo)

python app.py
```

Gradio will print a local URL (typically `http://127.0.0.1:7860`).

### Dependencies

```
gradio>=4.44.0
torch>=2.1.0
torchvision>=0.16.0
transformers>=4.45.0
huggingface_hub>=0.24.0
numpy>=1.26.0
pillow>=10.0.0
scikit-image>=0.22.0
matplotlib>=3.7.0
```

On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model).

### Quantitative evaluation: edge artifacts

`quantitative-tests/benchmark_edge_artifacts.py` runs the same fixed f/1.2 pipeline as `prototype.py` on the four JPGs in `cache/` and compares **edge artifacts** against the original image for two compositing methods:

1. **Flat Gaussian background** — uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask).
2. **CoC-weighted NN render** — RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched).

Metrics are computed in three regions where compositing artifacts show up most: **in-focus** pixels (CoC ≤ 0.4), the **transition band** (0.4 < CoC ≤ 1.0), and a **boundary ring** (±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in `quantitative-tests/results/edge_artifact_benchmark.json`.

**Setup:** f/1.2, focal length 6.765 mm, Gaussian σ = 12 px, images resized to max side 768 px, focus at image center.

| Metric (avg. across 4 images) | Gaussian vs original | NN render vs original | NN improvement |
|---|---:|---:|---:|
| Boundary ring mean abs diff | 0.066 | 0.007 | **9.3×** |
| Boundary ring mean grad excess | 0.044 | 0.005 | **8.2×** |
| Global mean grad excess | 0.025 | 0.009 | **2.8×** |
| In-focus mean grad excess | 0.0008 | 0.00009 | **8.9×** |

**Takeaways:**

- The hard Gaussian mask produces strong halos at the focus/defocus boundary — boundary-ring pixel error averages **~0.066** vs **~0.007** for the NN (roughly **6–21×** better per image).
- In-focus regions stay clean for both methods (mean abs diff ≈ 0), but the NN leaks less spurious edge energy even there (**~9×** lower in-focus grad excess on average).
- The NN’s smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor.

Re-run the benchmark:

```bash
python quantitative-tests/benchmark_edge_artifacts.py
```

## Project structure

```
.
├── app.py                  # Gradio app, RendererNet definition, full inference pipeline
├── requirements.txt        # Python dependencies
├── renderer/
│   └── best_renderer.pth   # Trained RendererNet weights
└── cache/                  # Optional example images for Gradio Examples
```

## Limitations

- **Fixed focus point:** Focus is always the image center; there is no click-to-focus or subject detection.
- **Relative depth only:** Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation.
- **Resolution cap:** Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive.
- **RendererNet runs at 512×512:** Fine detail may be softened; output is upsampled to the working resolution.
- **CPU Spaces are slow:** First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround.

## Acknowledgments

- [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) — monocular depth estimation ([Lihe Yang et al.](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf))
- [Hugging Face Transformers](https://github.com/huggingface/transformers) — model loading and inference
- [Gradio](https://gradio.app/) — web UI