File size: 7,558 Bytes
bef445e 556456f 3b9ff1b bef445e 556456f 3b9ff1b 556456f 3b9ff1b 556456f afec961 556456f afec961 556456f 374469f 556456f afec961 556456f dc95a1d 556456f afec961 556456f 1aecf6d 689a8bc 556456f 8b530a5 bf3686e 556456f bf3686e 556456f bf3686e 556456f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: apple-amlr
library_name: ml-sharp
pipeline_tag: image-to-3d
base_model: apple/Sharp
tags:
- coreml
- monocular-view-synthesis
- gaussian-splatting
---
# Sharp Monocular View Synthesis in Less Than a Second (Core ML Edition)
[](https://apple.github.io/ml-sharp/)
[](https://arxiv.org/abs/2512.10685)
This software project is a communnity contribution and not affiliated with the original the research paper:
> _Sharp Monocular View Synthesis in Less Than a Second_ by _Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Amaël Delaunoy, Tian Fang, Yanghai Tsin, Stephan Richter and Vladlen Koltun_.
> We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements.
#### This release includes a fully validated **Core ML (.mlpackage)** version of SHARP, optimized for CPU, GPU, and Neural Engine inference on macOS and iOS.

Rendered using [Splat Viewer](https://huggingface.co/spaces/pearsonkyle/Gaussian-Splat-Viewer)
## Getting started
### 📦 Download the Core ML Model Only
```bash
pip install huggingface-hub
huggingface-cli download --include sharp.mlpackage/ --local-dir . pearsonkyle/Sharp-coreml
```
### 🧰 Clone the Full Repository
This will include the inference and model conversion/validation scripts.
```bash
brew install git-xet
git xet install
```
Clone the model repository:
```bash
git clone git@hf.co:pearsonkyle/Sharp-coreml
```
### 📱 Run Inference on Apple Devices
Use the provided [sharp.swift](sharp.swift) inference script to load the model and generate 3D Gaussian splats (PLY) from any image:
```bash
# Compile the Swift runner (requires Xcode command-line tools)
swiftc -O -o run_sharp sharp.swift -framework CoreML -framework CoreImage -framework AppKit
# Run inference on an image and decimate the output by 50%
./run_sharp sharp.mlpackage test.png test.ply -d 0.5
```
> Inference on an Apple M4 Max takes ~1.9 seconds.
**CLI Features:**
- Automatic model compilation and caching
- Decimation to reduce point cloud size while preserving visual fidelity
- Input is expected as a standard RGB image; conversion to [0,1] and CHW format happens inside the model
- PLY output compatible with [Splat Viewer](https://huggingface.co/spaces/pearsonkyle/Gaussian-Splat-Viewer), [MetalSplatter](https://github.com/scier/MetalSplatter), and [Three.js](https://threejs.org)
```bash
Usage: \(execName) [OPTIONS] <model> <input_image> <output.ply>
SHARP Model Inference - Generate 3D Gaussian Splats from a single image
Arguments:
model Path to the SHARP Core ML model (.mlpackage, .mlmodel, or .mlmodelc)
input_image Path to input image (PNG, JPEG, etc.)
output.ply Path for output PLY file
Options:
-m, --model PATH Path to Core ML model
-i, --input PATH Path to input image
-o, --output PATH Path for output PLY file
-f, --focal-length FLOAT Focal length in pixels (default: 1536)
-d, --decimation FLOAT Decimation ratio 0.0-1.0 or percentage 1-100 (default: 1.0 = keep all)
Example: 0.5 or 50 keeps 50% of Gaussians
-h, --help Show this help message
```
## Model Input and Output
### 📥 Input
The Core ML model accepts two inputs:
- **`image`**: A 3-channel RGB image in `uint8` format with shape `(1, 3, H, W)`.
- Values are expected in range `[0, 255]` (no manual normalization required).
- Recommended resolution: `1536×1536` (matches training size).
- Aspect ratio is preserved; input will be resized internally if needed.
- **`disparity_factor`**: A scalar tensor of shape `(1,)` representing the ratio `focal_length / image_width`.
- Use `1.0` for standard cameras (e.g., typical smartphone or DSLR).
- Adjust slightly to control depth scale: higher values = closer objects, lower values = farther scenes.
- If using the `sharp.swift` runner, this input is automatically computed from your image dimensions.
### 📤 Output
The model outputs five tensors representing a 3D Gaussian splat representation:
| Output | Shape | Description |
|--------|-------|-------------|
| `mean_vectors_3d_positions` | `(1, N, 3)` | 3D positions in Normalized Device Coordinates (NDC) — x, y, z. |
| `singular_values_scales` | `(1, N, 3)` | Scale parameters along each principal axis (width, height, depth). |
| `quaternions_rotations` | `(1, N, 4)` | Unit quaternions `[w, x, y, z]` encoding orientation of each Gaussian. |
| `colors_rgb_linear` | `(1, N, 3)` | Linear RGB color values in range `[0, 1]` (no gamma correction). |
| `opacities_alpha_channel` | `(1, N)` | Opacity (alpha) values per Gaussian, in range `[0, 1]`. |
The total number of Gaussians `N` is approximately 1,179,648 for the default model.
> 🌍 These outputs are fully compatible with [Splat Viewer](https://huggingface.co/spaces/pearsonkyle/Gaussian-Splat-Viewer) and [MetalSplatter](https://github.com/scier/MetalSplatter).
### 🔍 Model Validation Results
The Core ML model has been rigorously validated against the original PyTorch implementation. Below are the numerical accuracy metrics across all 5 output tensors:
| Output | Max Diff | Mean Diff | P99 Diff | Angular Diff (°) | Status |
|--------|----------|-----------|----------|------------------|--------|
| Mean Vectors (3D Positions) | 0.000794 | 0.000049 | 0.000094 | - | ✅ PASS |
| Singular Values (Scales) | 0.000035 | 0.000000 | 0.000002 | - | ✅ PASS |
| Quaternions (Rotations) | 1.425558 | 0.000024 | 0.000067 | 9.2519 / 0.0019 / 0.0396 | ✅ PASS |
| Colors (RGB Linear) | 0.001440 | 0.000005 | 0.000055 | - | ✅ PASS |
| Opacities (Alpha) | 0.004183 | 0.000005 | 0.000114 | - | ✅ PASS |
> **Validation Notes:**
> - All outputs match PyTorch within 0.01% mean error.
> - Quaternion angular errors are below 1° for 99% of Gaussians.
## Reproducing the Conversion
To reproduce the conversion from PyTorch to Core ML, follow these steps:
```
git clone https://github.com/apple/ml-sharp.git
cd ml-sharp
conda create -n sharp python=3.13
conda activate sharp
pip install -r requirements.txt
pip install coremltools
cd ../
python convert.py
```
## Citation
If you find this work useful, please cite the original paper:
```bibtex
@inproceedings{Sharp2025:arxiv,
title = {Sharp Monocular View Synthesis in Less Than a Second},
author = {Lars Mescheder and Wei Dong and Shiwei Li and Xuyang Bai and Marcel Santos and Peiyun Hu and Bruno Lecouat and Mingmin Zhen and Ama\"{e}l Delaunoy and Tian Fang and Yanghai Tsin and Stephan R. Richter and Vladlen Koltun},
journal = {arXiv preprint arXiv:2512.10685},
year = {2025},
url = {https://arxiv.org/abs/2512.10685},
}
```
|