--- title: Physically Based Portrait Mode Engine emoji: 🐠 colorFrom: gray colorTo: blue sdk: gradio sdk_version: 6.19.0 python_version: '3.13' app_file: app.py pinned: false short_description: An improved portrait mode renderer for pictures. --- # Physically-Based Portrait Mode Engine A Gradio demo that simulates **shallow depth-of-field** (portrait mode) on ordinary photos. Instead of applying a flat Gaussian blur to the background, it estimates scene depth, builds a **circle-of-confusion (CoC)** map, and uses a trained neural renderer (**RendererNet**) to produce physically motivated defocus. In-focus regions are preserved by blending the render back onto the original image. The app is designed to run as a [Hugging Face Space](https://huggingface.co/docs/hub/spaces) and is self-contained: model definitions and inference code live in a single `app.py` file. ## What it does Given an input photograph and camera parameters (f-stop, focal length), the pipeline: 1. **Estimates relative depth** with [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf) via the Hugging Face `transformers` library. 2. **Builds a pseudo CoC map** by measuring how far each pixel's depth deviates from the depth at the **image center** (the assumed focus point). Pixels closer to the focus depth get a CoC near zero; pixels farther away get larger CoC values. 3. **Renders defocus with RendererNet**, a U-Net that takes the RGB image plus normalized f-stop, focal length, and CoC channels and outputs a blurred RGB image at the chosen aperture settings. 4. **Blends the NN render onto the original** using a smooth CoC-based weight so that in-focus areas (CoC below a threshold) remain untouched. 5. **Produces a non-NN baseline** for comparison: a flat Gaussian blur applied only where CoC exceeds a separate threshold. Three outputs are shown: | Output | Description | |--------|-------------| | **Rendered (NN)** | RendererNet defocus, blended with the original in in-focus regions | | **Gaussian baseline** | Simple background blur for comparison against the learned renderer | | **Pseudo CoC map** | Colorized circle-of-confusion visualization (inferno colormap) | ## Pipeline overview ``` Input image │ ▼ Resize (longest side ≤ 1024 px) │ ▼ Depth Anything V2 ──► relative depth [0, 1] │ ▼ Pseudo CoC map (focus = image center, max 4 px) │ ├──────────────────────────────┐ ▼ ▼ RendererNet (512×512) Gaussian blur baseline f-stop + focal length + CoC (CoC > threshold) │ │ ▼ │ CoC-weighted blend ◄───────────────┘ │ ▼ Final outputs (NN render, baseline, CoC map) ``` ### Pseudo CoC Because the demo has no interactive focus point, focus is fixed at the **center of the image** (`h // 2`, `w // 2`). The CoC at each pixel is proportional to the absolute difference between that pixel's relative depth and the depth at the focus point, scaled to a maximum of **4 px** (`COC_MAX_PX`). ### RendererNet input RendererNet is a U-Net with **6 input channels** and **3 output channels** (RGB): | Channel(s) | Content | |------------|---------| | 0–2 | RGB image, resized to 512×512 | | 3 | f-stop map, normalized by `F_STOP_MAX` (22.0) | | 4 | Focal length map, normalized by `FOCAL_LENGTH_MM_MAX` (200.0) | | 5 | CoC map, clipped to [0, 25] px and normalized by `COC_PX_NORM` (25.0) | The network output is resized back to the working resolution and blended with the original using a smoothstep weight derived from the CoC map and the **in-focus CoC threshold** slider. ### Blending The blend weight uses a smoothstep on CoC values above the focus threshold, so transitions between sharp and blurred regions are gradual rather than hard-edged. Where CoC is near zero, the original pixels are kept; where CoC is large, the RendererNet output dominates. ## Models ### Depth Anything V2 - **Default checkpoint:** `depth-anything/Depth-Anything-V2-Base-hf` - Loaded at startup via `AutoImageProcessor` and `AutoModelForDepthEstimation` from `transformers`. - Returns relative (not metric) depth, normalized per image to [0, 1]. - Requires **torch** and **torchvision**. Swap to a different variant via the `DEPTH_MODEL_ID` environment variable: | Model ID | Trade-off | |----------|-----------| | `depth-anything/Depth-Anything-V2-Small-hf` | Faster, lower quality — good for CPU Spaces | | `depth-anything/Depth-Anything-V2-Base-hf` | Default; balanced speed and quality | | `depth-anything/Depth-Anything-V2-Large-hf` | Best quality, slowest | ### RendererNet - A U-Net architecture (inlined in `app.py`) trained to render defocus given RGB + camera parameters + CoC. - Weights are loaded from **`renderer/best_renderer.pth`** by default. - Can alternatively be fetched from the Hugging Face Hub when `RENDERER_REPO_ID` is set. ## Using the demo 1. Upload an image (JPEG or PNG). 2. Adjust **f-stop** (0.95–22.0) — lower values produce stronger background blur. 3. Adjust **focal length** (4–200 mm) — longer focal lengths increase the shallow-DoF effect. 4. Tune **in-focus CoC threshold** — pixels with CoC below this value suppress the NN render and keep the original sharp. 5. Optionally expand **Gaussian baseline** to configure the comparison blur (CoC threshold and sigma). 6. Click **Render**. If a `cache/` directory with example images exists, sample images appear in the Examples panel. ## Configuration All settings below can be overridden with environment variables (useful for Hugging Face Space **Settings → Repository secrets / Variables** without editing code). | Variable | Default | Description | |----------|---------|-------------| | `DEPTH_MODEL_ID` | `depth-anything/Depth-Anything-V2-Base-hf` | Hugging Face model ID for depth estimation | | `RENDERER_LOCAL_PATH` | `renderer/best_renderer.pth` | Path to RendererNet weights on disk | | `RENDERER_REPO_ID` | *(empty)* | Hugging Face repo to download weights from | | `RENDERER_FILENAME` | `best_renderer.pth` | Filename within the Hub repo | | `HF_TOKEN` | *(empty)* | Token for private Hub repos (only needed with `RENDERER_REPO_ID`) | ### Internal constants These are fixed in code and must match RendererNet's training setup: | Constant | Value | Purpose | |----------|-------|---------| | `F_STOP_MAX` | 22.0 | f-stop normalization divisor | | `FOCAL_LENGTH_MM_MAX` | 200.0 | Focal length normalization divisor | | `COC_PX_NORM` | 25.0 | CoC channel clip and normalize | | `TARGET_SIZE` | 512 | RendererNet spatial resolution | | `COC_MAX_PX` | 4.0 | Maximum pseudo CoC magnitude | | `MAX_SIDE` | 1024 | Longest image side before inference | ## Local development ### Requirements - Python 3.10+ (3.13 on Hugging Face Spaces) - CUDA optional (falls back to CPU) ### Install and run ```bash git clone https://github.com//Physically-Based-Portrait-Mode-Engine.git cd Physically-Based-Portrait-Mode-Engine pip install -r requirements.txt # Ensure RendererNet weights are present # (renderer/best_renderer.pth is included in the repo) python app.py ``` Gradio will print a local URL (typically `http://127.0.0.1:7860`). ### Dependencies ``` gradio>=4.44.0 torch>=2.1.0 torchvision>=0.16.0 transformers>=4.45.0 huggingface_hub>=0.24.0 numpy>=1.26.0 pillow>=10.0.0 scikit-image>=0.22.0 matplotlib>=3.7.0 ``` On first run, Depth Anything V2 weights are downloaded from the Hugging Face Hub (~370 MB for the Base model). ### Quantitative evaluation: edge artifacts `quantitative-tests/benchmark_edge_artifacts.py` runs the same fixed f/1.2 pipeline as `prototype.py` on the four JPGs in `cache/` and compares **edge artifacts** against the original image for two compositing methods: 1. **Flat Gaussian background** — uniform blur pasted in wherever pseudo-CoC > 1 px (hard mask). 2. **CoC-weighted NN render** — RendererNet output blended back onto the original with a smoothstep weight (in-focus pixels stay untouched). Metrics are computed in three regions where compositing artifacts show up most: **in-focus** pixels (CoC ≤ 0.4), the **transition band** (0.4 < CoC ≤ 1.0), and a **boundary ring** (±3 px around the Gaussian mask edge). Lower is better. Full per-image numbers are in `quantitative-tests/results/edge_artifact_benchmark.json`. **Setup:** f/1.2, focal length 6.765 mm, Gaussian σ = 12 px, images resized to max side 768 px, focus at image center. | Metric (avg. across 4 images) | Gaussian vs original | NN render vs original | NN improvement | |---|---:|---:|---:| | Boundary ring mean abs diff | 0.066 | 0.007 | **9.3×** | | Boundary ring mean grad excess | 0.044 | 0.005 | **8.2×** | | Global mean grad excess | 0.025 | 0.009 | **2.8×** | | In-focus mean grad excess | 0.0008 | 0.00009 | **8.9×** | **Takeaways:** - The hard Gaussian mask produces strong halos at the focus/defocus boundary — boundary-ring pixel error averages **~0.066** vs **~0.007** for the NN (roughly **6–21×** better per image). - In-focus regions stay clean for both methods (mean abs diff ≈ 0), but the NN leaks less spurious edge energy even there (**~9×** lower in-focus grad excess on average). - The NN’s smoothstep blend removes most of the visible cut-line artifact; the flat Gaussian baseline is useful mainly as a simple comparison point, not as a production compositor. Re-run the benchmark: ```bash python quantitative-tests/benchmark_edge_artifacts.py ``` ## Project structure ``` . ├── app.py # Gradio app, RendererNet definition, full inference pipeline ├── requirements.txt # Python dependencies ├── renderer/ │ └── best_renderer.pth # Trained RendererNet weights └── cache/ # Optional example images for Gradio Examples ``` ## Limitations - **Fixed focus point:** Focus is always the image center; there is no click-to-focus or subject detection. - **Relative depth only:** Depth Anything V2 outputs ordinal depth, not metric distances, so the CoC map is a heuristic rather than a physically exact optical simulation. - **Resolution cap:** Images are downscaled so the longest side is at most 1024 px to keep CPU inference responsive. - **RendererNet runs at 512×512:** Fine detail may be softened; output is upsampled to the working resolution. - **CPU Spaces are slow:** First inference after startup can take tens of seconds; consider the Small depth model or a GPU Space for faster turnaround. ## Acknowledgments - [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) — monocular depth estimation ([Lihe Yang et al.](https://huggingface.co/depth-anything/Depth-Anything-V2-Base-hf)) - [Hugging Face Transformers](https://github.com/huggingface/transformers) — model loading and inference - [Gradio](https://gradio.app/) — web UI