Tejaswi Tripathi
Quantitative tests
13d40ab
|
Raw
History Blame Contribute Delete
11 kB
---
title: Physically Based Portrait Mode Engine
emoji: 🐠
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 6.19.0
python_version: '3.13'
app_file: app.py
pinned: false
short_description: An improved portrait mode renderer for pictures.
---
# Physically-Based Portrait Mode Engine
A Gradio demo that simulates **shallow depth-of-field** (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a **circle-of-confusion (CoC)** map, and uses a trained neural renderer (**RendererNet**) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image.
The app is designed to run as a [Hugging Face Space](https://huggingface.co/docs/hub/spaces) and is self-contained: model definitions and inference code live in a single `app.py` file.
## What it does
Given an input photograph and camera parameters (f-stop, focal length), the pipeline:
1. **Estimates relative depth** with [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf) via the Hugging Face `transformers` library.
2. **Builds a pseudo CoC map** by measuring how far each pixel's depth deviates from the depth at the **image center** (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values.
3. **Renders defocus with RendererNet**, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings.
4. **Blends the NN render onto the original** using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched.
5. **Produces a non-NN baseline** for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold.
Three outputs are shown:
| Output | Description |
|--------|-------------|
| **Rendered (NN)** | RendererNet defocus, blended with the original in in-focus regions |
| **Gaussian baseline** | Simple background blur for comparison against the learned renderer |
| **Pseudo CoC map** | Colorized circle-of-confusion visualization (inferno colormap) |
## Pipeline overview
```
Input image
β”‚
β–Ό
Resize (longest side ≀ 1024 px)
β”‚
β–Ό
Depth Anything V2 ──► relative depth [0, 1]
β”‚
β–Ό
Pseudo CoC map (focus = image center, max 4 px)
β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό
RendererNet (512Γ—512) Gaussian blur baseline
f-stop + focal length + CoC (CoC > threshold)
β”‚ β”‚
β–Ό β”‚
CoC-weighted blend β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Final outputs (NN render, baseline, CoC map)
```
### Pseudo CoC
Because the demo has no interactive focus point, focus is fixed at the **center of the image** (`h // 2`, `w // 2`). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of **4 px** (`COC_MAX_PX`).
### RendererNet input
RendererNet is a U-Net with **6 input channels** and **3 output channels** (RGB):
| Channel(s) | Content |
|------------|---------|
| 0–2 | RGB image, resized to 512Γ—512 |
| 3 | f-stop map, normalized by `F_STOP_MAX` (22.0) |
| 4 | Focal length map, normalized by `FOCAL_LENGTH_MM_MAX` (200.0) |
| 5 | CoC map, clipped to [0, 25] px and normalized by `COC_PX_NORM` (25.0) |
The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the **in-focus CoC threshold** slider.
### Blending
The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates.
## Models
### Depth Anything V2
- **Default checkpoint:** `depth-anything/Depth-Anything-V2-Base-hf`
- Loaded at startup via `AutoImageProcessor` and `AutoModelForDepthEstimation` from `transformers`.
- Returns relative (not metric) depth, normalized per image to [0, 1].
- Requires **torch** and **torchvision**.
Swap to a different variant via the `DEPTH_MODEL_ID` environment variable:
| Model ID | Trade-off |
|----------|-----------|
| `depth-anything/Depth-Anything-V2-Small-hf` | Faster, lower quality β€” good for CPU Spaces |
| `depth-anything/Depth-Anything-V2-Base-hf` | Default; balanced speed and quality |
| `depth-anything/Depth-Anything-V2-Large-hf` | Best quality, slowest |
### RendererNet
- A U-Net architecture (inlined in `app.py`) trained to render defocus given RGB + camera parameters + CoC.
- Weights are loaded from **`renderer/best_renderer.pth`** by default.
- Can alternatively be fetched from the Hugging Face Hub when `RENDERER_REPO_ID` is set.
## Using the demo
1. Upload an image (JPEG or PNG).
2. Adjust **f-stop** (0.95–22.0) β€” lower values produce stronger background blur.
3. Adjust **focal length** (4–200 mm) β€” longer focal lengths increase the shallow-DoF effect.
4. Tune **in-focus CoC threshold** β€” pixels with CoC below this value suppress the NN render and keep the original sharp.
5. Optionally expand **Gaussian baseline** to configure the comparison blur (CoC threshold and sigma).
6. Click **Render**.
If a `cache/` directory with example images exists, sample images appear in the Examples panel.
## Configuration
All settings below can be overridden with environment variables (useful for Hugging Face Space **Settings β†’ Repository secrets / Variables** without editing code).
| Variable | Default | Description |
|----------|---------|-------------|
| `DEPTH_MODEL_ID` | `depth-anything/Depth-Anything-V2-Base-hf` | Hugging Face model ID for depth estimation |
| `RENDERER_LOCAL_PATH` | `renderer/best_renderer.pth` | Path to RendererNet weights on disk |
| `RENDERER_REPO_ID` | *(empty)* | Hugging Face repo to download weights from |
| `RENDERER_FILENAME` | `best_renderer.pth` | Filename within the Hub repo |
| `HF_TOKEN` | *(empty)* | Token for private Hub repos (only needed with `RENDERER_REPO_ID`) |
### Internal constants
These are fixed in code and must match RendererNet's training setup:
| Constant | Value | Purpose |
|----------|-------|---------|
| `F_STOP_MAX` | 22.0 | f-stop normalization divisor |
| `FOCAL_LENGTH_MM_MAX` | 200.0 | Focal length normalization divisor |
| `COC_PX_NORM` | 25.0 | CoC channel clip and normalize |
| `TARGET_SIZE` | 512 | RendererNet spatial resolution |
| `COC_MAX_PX` | 4.0 | Maximum pseudo CoC magnitude |
| `MAX_SIDE` | 1024 | Longest image side before inference |
## Local development
### Requirements
- Python 3.10+ (3.13 on Hugging Face Spaces)
- CUDA optional (falls back to CPU)
### Install and run
```bash
git clone https://github.com/<your-username>/Physically-Based-Portrait-Mode-Engine.git
cd Physically-Based-Portrait-Mode-Engine
pip install -r requirements.txt
# Ensure RendererNet weights are present
# (renderer/best_renderer.pth is included in the repo)
python app.py
```
Gradio will print a local URL (typically `http://127.0.0.1:7860`).
### Dependencies
```
gradio>=4.44.0
torch>=2.1.0
torchvision>=0.16.0
transformers>=4.45.0
huggingface_hub>=0.24.0
numpy>=1.26.0
pillow>=10.0.0
scikit-image>=0.22.0
matplotlib>=3.7.0
```
On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model).
### Quantitative evaluation: edge artifacts
`quantitative-tests/benchmark_edge_artifacts.py` runs the same fixed f/1.2 pipeline as `prototype.py` on the four JPGs in `cache/` and compares **edge artifacts** against the original image for two compositing methods:
1. **Flat Gaussian background** β€” uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask).
2. **CoC-weighted NN render** β€” RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched).
Metrics are computed in three regions where compositing artifacts show up most: **in-focus** pixels (CoC ≀ 0.4), the **transition band** (0.4 < CoC ≀ 1.0), and a **boundary ring** (Β±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in `quantitative-tests/results/edge_artifact_benchmark.json`.
**Setup:** f/1.2, focal length 6.765 mm, Gaussian Οƒ = 12 px, images resized to max side 768 px, focus at image center.
| Metric (avg. across 4 images) | Gaussian vs original | NN render vs original | NN improvement |
|---|---:|---:|---:|
| Boundary ring mean abs diff | 0.066 | 0.007 | **9.3Γ—** |
| Boundary ring mean grad excess | 0.044 | 0.005 | **8.2Γ—** |
| Global mean grad excess | 0.025 | 0.009 | **2.8Γ—** |
| In-focus mean grad excess | 0.0008 | 0.00009 | **8.9Γ—** |
**Takeaways:**
- The hard Gaussian mask produces strong halos at the focus/defocus boundary β€” boundary-ring pixel error averages **~0.066** vs **~0.007** for the NN (roughly **6–21Γ—** better per image).
- In-focus regions stay clean for both methods (mean abs diff β‰ˆ 0), but the NN leaks less spurious edge energy even there (**~9Γ—** lower in-focus grad excess on average).
- The NN’s smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor.
Re-run the benchmark:
```bash
python quantitative-tests/benchmark_edge_artifacts.py
```
## Project structure
```
.
β”œβ”€β”€ app.py # Gradio app, RendererNet definition, full inference pipeline
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ renderer/
β”‚ └── best_renderer.pth # Trained RendererNet weights
└── cache/ # Optional example images for Gradio Examples
```
## Limitations
- **Fixed focus point:** Focus is always the image center; there is no click-to-focus or subject detection.
- **Relative depth only:** Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation.
- **Resolution cap:** Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive.
- **RendererNet runs at 512Γ—512:** Fine detail may be softened; output is upsampled to the working resolution.
- **CPU Spaces are slow:** First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround.
## Acknowledgments
- [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) β€” monocular depth estimation ([Lihe Yang et al.](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf))
- [Hugging Face Transformers](https://github.com/huggingface/transformers) β€” model loading and inference
- [Gradio](https://gradio.app/) β€” web UI