| --- |
| title: Physically Based Portrait Mode Engine |
| emoji: π |
| colorFrom: gray |
| colorTo: blue |
| sdk: gradio |
| sdk_version: 6.19.0 |
| python_version: '3.13' |
| app_file: app.py |
| pinned: false |
| short_description: An improved portrait mode renderer for pictures. |
| --- |
| |
| # Physically-Based Portrait Mode Engine |
|
|
| A Gradio demo that simulates **shallow depth-of-field** (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a **circle-of-confusion (CoC)** map, and uses a trained neural renderer (**RendererNet**) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image. |
|
|
| The app is designed to run as a [Hugging Face Space](https://huggingface.co/docs/hub/spaces) and is self-contained: model definitions and inference code live in a single `app.py` file. |
|
|
| ## What it does |
|
|
| Given an input photograph and camera parameters (f-stop, focal length), the pipeline: |
|
|
| 1. **Estimates relative depth** with [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf) via the Hugging Face `transformers` library. |
| 2. **Builds a pseudo CoC map** by measuring how far each pixel's depth deviates from the depth at the **image center** (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values. |
| 3. **Renders defocus with RendererNet**, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings. |
| 4. **Blends the NN render onto the original** using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched. |
| 5. **Produces a non-NN baseline** for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold. |
|
|
| Three outputs are shown: |
|
|
| | Output | Description | |
| |--------|-------------| |
| | **Rendered (NN)** | RendererNet defocus, blended with the original in in-focus regions | |
| | **Gaussian baseline** | Simple background blur for comparison against the learned renderer | |
| | **Pseudo CoC map** | Colorized circle-of-confusion visualization (inferno colormap) | |
|
|
| ## Pipeline overview |
|
|
| ``` |
| Input image |
| β |
| βΌ |
| Resize (longest side β€ 1024 px) |
| β |
| βΌ |
| Depth Anything V2 βββΊ relative depth [0, 1] |
| β |
| βΌ |
| Pseudo CoC map (focus = image center, max 4 px) |
| β |
| ββββββββββββββββββββββββββββββββ |
| βΌ βΌ |
| RendererNet (512Γ512) Gaussian blur baseline |
| f-stop + focal length + CoC (CoC > threshold) |
| β β |
| βΌ β |
| CoC-weighted blend βββββββββββββββββ |
| β |
| βΌ |
| Final outputs (NN render, baseline, CoC map) |
| ``` |
|
|
| ### Pseudo CoC |
|
|
| Because the demo has no interactive focus point, focus is fixed at the **center of the image** (`h // 2`, `w // 2`). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of **4 px** (`COC_MAX_PX`). |
|
|
| ### RendererNet input |
|
|
| RendererNet is a U-Net with **6 input channels** and **3 output channels** (RGB): |
|
|
| | Channel(s) | Content | |
| |------------|---------| |
| | 0β2 | RGB image, resized to 512Γ512 | |
| | 3 | f-stop map, normalized by `F_STOP_MAX` (22.0) | |
| | 4 | Focal length map, normalized by `FOCAL_LENGTH_MM_MAX` (200.0) | |
| | 5 | CoC map, clipped to [0, 25] px and normalized by `COC_PX_NORM` (25.0) | |
|
|
| The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the **in-focus CoC threshold** slider. |
|
|
| ### Blending |
|
|
| The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates. |
|
|
| ## Models |
|
|
| ### Depth Anything V2 |
|
|
| - **Default checkpoint:** `depth-anything/Depth-Anything-V2-Base-hf` |
| - Loaded at startup via `AutoImageProcessor` and `AutoModelForDepthEstimation` from `transformers`. |
| - Returns relative (not metric) depth, normalized per image to [0, 1]. |
| - Requires **torch** and **torchvision**. |
|
|
| Swap to a different variant via the `DEPTH_MODEL_ID` environment variable: |
|
|
| | Model ID | Trade-off | |
| |----------|-----------| |
| | `depth-anything/Depth-Anything-V2-Small-hf` | Faster, lower quality β good for CPU Spaces | |
| | `depth-anything/Depth-Anything-V2-Base-hf` | Default; balanced speed and quality | |
| | `depth-anything/Depth-Anything-V2-Large-hf` | Best quality, slowest | |
|
|
| ### RendererNet |
|
|
| - A U-Net architecture (inlined in `app.py`) trained to render defocus given RGB + camera parameters + CoC. |
| - Weights are loaded from **`renderer/best_renderer.pth`** by default. |
| - Can alternatively be fetched from the Hugging Face Hub when `RENDERER_REPO_ID` is set. |
| |
| ## Using the demo |
| |
| 1. Upload an image (JPEG or PNG). |
| 2. Adjust **f-stop** (0.95β22.0) β lower values produce stronger background blur. |
| 3. Adjust **focal length** (4β200 mm) β longer focal lengths increase the shallow-DoF effect. |
| 4. Tune **in-focus CoC threshold** β pixels with CoC below this value suppress the NN render and keep the original sharp. |
| 5. Optionally expand **Gaussian baseline** to configure the comparison blur (CoC threshold and sigma). |
| 6. Click **Render**. |
| |
| If a `cache/` directory with example images exists, sample images appear in the Examples panel. |
| |
| ## Configuration |
| |
| All settings below can be overridden with environment variables (useful for Hugging Face Space **Settings β Repository secrets / Variables** without editing code). |
| |
| | Variable | Default | Description | |
| |----------|---------|-------------| |
| | `DEPTH_MODEL_ID` | `depth-anything/Depth-Anything-V2-Base-hf` | Hugging Face model ID for depth estimation | |
| | `RENDERER_LOCAL_PATH` | `renderer/best_renderer.pth` | Path to RendererNet weights on disk | |
| | `RENDERER_REPO_ID` | *(empty)* | Hugging Face repo to download weights from | |
| | `RENDERER_FILENAME` | `best_renderer.pth` | Filename within the Hub repo | |
| | `HF_TOKEN` | *(empty)* | Token for private Hub repos (only needed with `RENDERER_REPO_ID`) | |
| |
| ### Internal constants |
| |
| These are fixed in code and must match RendererNet's training setup: |
| |
| | Constant | Value | Purpose | |
| |----------|-------|---------| |
| | `F_STOP_MAX` | 22.0 | f-stop normalization divisor | |
| | `FOCAL_LENGTH_MM_MAX` | 200.0 | Focal length normalization divisor | |
| | `COC_PX_NORM` | 25.0 | CoC channel clip and normalize | |
| | `TARGET_SIZE` | 512 | RendererNet spatial resolution | |
| | `COC_MAX_PX` | 4.0 | Maximum pseudo CoC magnitude | |
| | `MAX_SIDE` | 1024 | Longest image side before inference | |
| |
| ## Local development |
| |
| ### Requirements |
| |
| - Python 3.10+ (3.13 on Hugging Face Spaces) |
| - CUDA optional (falls back to CPU) |
| |
| ### Install and run |
| |
| ```bash |
| git clone https://github.com/<your-username>/Physically-Based-Portrait-Mode-Engine.git |
| cd Physically-Based-Portrait-Mode-Engine |
| |
| pip install -r requirements.txt |
| |
| # Ensure RendererNet weights are present |
| # (renderer/best_renderer.pth is included in the repo) |
| |
| python app.py |
| ``` |
| |
| Gradio will print a local URL (typically `http://127.0.0.1:7860`). |
| |
| ### Dependencies |
| |
| ``` |
| gradio>=4.44.0 |
| torch>=2.1.0 |
| torchvision>=0.16.0 |
| transformers>=4.45.0 |
| huggingface_hub>=0.24.0 |
| numpy>=1.26.0 |
| pillow>=10.0.0 |
| scikit-image>=0.22.0 |
| matplotlib>=3.7.0 |
| ``` |
| |
| On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model). |
| |
| ### Quantitative evaluation: edge artifacts |
| |
| `quantitative-tests/benchmark_edge_artifacts.py` runs the same fixed f/1.2 pipeline as `prototype.py` on the four JPGs in `cache/` and compares **edge artifacts** against the original image for two compositing methods: |
|
|
| 1. **Flat Gaussian background** β uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask). |
| 2. **CoC-weighted NN render** β RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched). |
|
|
| Metrics are computed in three regions where compositing artifacts show up most: **in-focus** pixels (CoC β€ 0.4), the **transition band** (0.4 < CoC β€ 1.0), and a **boundary ring** (Β±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in `quantitative-tests/results/edge_artifact_benchmark.json`. |
|
|
| **Setup:** f/1.2, focal length 6.765 mm, Gaussian Ο = 12 px, images resized to max side 768 px, focus at image center. |
|
|
| | Metric (avg. across 4 images) | Gaussian vs original | NN render vs original | NN improvement | |
| |---|---:|---:|---:| |
| | Boundary ring mean abs diff | 0.066 | 0.007 | **9.3Γ** | |
| | Boundary ring mean grad excess | 0.044 | 0.005 | **8.2Γ** | |
| | Global mean grad excess | 0.025 | 0.009 | **2.8Γ** | |
| | In-focus mean grad excess | 0.0008 | 0.00009 | **8.9Γ** | |
|
|
| **Takeaways:** |
|
|
| - The hard Gaussian mask produces strong halos at the focus/defocus boundary β boundary-ring pixel error averages **~0.066** vs **~0.007** for the NN (roughly **6β21Γ** better per image). |
| - In-focus regions stay clean for both methods (mean abs diff β 0), but the NN leaks less spurious edge energy even there (**~9Γ** lower in-focus grad excess on average). |
| - The NNβs smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor. |
|
|
| Re-run the benchmark: |
|
|
| ```bash |
| python quantitative-tests/benchmark_edge_artifacts.py |
| ``` |
|
|
| ## Project structure |
|
|
| ``` |
| . |
| βββ app.py # Gradio app, RendererNet definition, full inference pipeline |
| βββ requirements.txt # Python dependencies |
| βββ renderer/ |
| β βββ best_renderer.pth # Trained RendererNet weights |
| βββ cache/ # Optional example images for Gradio Examples |
| ``` |
|
|
| ## Limitations |
|
|
| - **Fixed focus point:** Focus is always the image center; there is no click-to-focus or subject detection. |
| - **Relative depth only:** Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation. |
| - **Resolution cap:** Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive. |
| - **RendererNet runs at 512Γ512:** Fine detail may be softened; output is upsampled to the working resolution. |
| - **CPU Spaces are slow:** First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround. |
|
|
| ## Acknowledgments |
|
|
| - [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) β monocular depth estimation ([Lihe Yang et al.](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf)) |
| - [Hugging Face Transformers](https://github.com/huggingface/transformers) β model loading and inference |
| - [Gradio](https://gradio.app/) β web UI |
|
|