Update model card with paper link and sample usage snippets
Browse filesHi! I'm Niels from the Hugging Face community team.
This PR improves the model card for DeFM by:
- Replacing the placeholder `TODO` Arxiv link with the correct link to the paper: [DeFM: Learning Foundation Representations from Depth for Robotics](https://huggingface.co/papers/2601.18923).
- Adding a detailed **Usage** section with code snippets for model loading, preprocessing, and inference, based on the documentation in your GitHub repository.
These changes help users get started with the model directly from the Hub. Please let me know if you have any questions!
README.md
CHANGED
|
@@ -2,13 +2,14 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
pipeline_tag: image-feature-extraction
|
| 4 |
---
|
|
|
|
| 5 |
# DeFM: Learning Foundation Representations from Depth for Robotics
|
| 6 |
|
| 7 |
<div align="center">
|
| 8 |
|
| 9 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 10 |
[](https://github.com/leggedrobotics/defm)
|
| 11 |
-
[](https://de-fm.github.io/)
|
| 13 |
</div>
|
| 14 |
|
|
@@ -25,9 +26,40 @@ TL;DR - A DINO-style encoder, but for depth image inputs.
|
|
| 25 |
- **Compact efficient models**: We distill our DeFM-ViT-L into a family of smaller efficient CNNs as small as 3M params for robot policy learning.
|
| 26 |
- **Robotics Proven**: Our encoder is proven effective for diverse robotic tasks such as navigation, manipulation and locomotion without task-specific fine-tuning.
|
| 27 |
|
| 28 |
-
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
## 📊 Model Zoo
|
| 33 |
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
pipeline_tag: image-feature-extraction
|
| 4 |
---
|
| 5 |
+
|
| 6 |
# DeFM: Learning Foundation Representations from Depth for Robotics
|
| 7 |
|
| 8 |
<div align="center">
|
| 9 |
|
| 10 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 11 |
[](https://github.com/leggedrobotics/defm)
|
| 12 |
+
[](https://arxiv.org/abs/2601.18923)
|
| 13 |
[](https://de-fm.github.io/)
|
| 14 |
</div>
|
| 15 |
|
|
|
|
| 26 |
- **Compact efficient models**: We distill our DeFM-ViT-L into a family of smaller efficient CNNs as small as 3M params for robot policy learning.
|
| 27 |
- **Robotics Proven**: Our encoder is proven effective for diverse robotic tasks such as navigation, manipulation and locomotion without task-specific fine-tuning.
|
| 28 |
|
| 29 |
+
## 🚀 Usage
|
| 30 |
+
|
| 31 |
+
### 1. Loading the Model
|
| 32 |
+
Load via **TorchHub** for easy integration:
|
| 33 |
+
|
| 34 |
+
```python
|
| 35 |
+
import torch
|
| 36 |
+
|
| 37 |
+
# Load the 307M Parameter Foundation Model
|
| 38 |
+
model = torch.hub.load('leggedrobotics/defm:main', 'defm_vit_l14', pretrained=True)
|
| 39 |
+
model.eval().to("cuda")
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
### 2. Preprocessing
|
| 43 |
+
DeFM requires depth maps to be processed into our metric-aware 3-channel format.
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
from defm import preprocess_depth_image
|
| 47 |
+
|
| 48 |
+
# Depth needs to be in meters (numpy array, tensor or PIL image)
|
| 49 |
+
normalized_depth = preprocess_depth_image(metric_depth, target_size=518, patch_size=14)
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### 3. Inference
|
| 53 |
+
```python
|
| 54 |
+
with torch.no_grad():
|
| 55 |
+
output = model.get_intermediate_layers(
|
| 56 |
+
normalized_depth, n=1, reshape=True, return_class_token=True)
|
| 57 |
+
|
| 58 |
+
spatial_tokens = output[0][0] # (B, C, H', W')
|
| 59 |
+
class_token = output[0][1] # (B, C)
|
| 60 |
+
```
|
| 61 |
|
| 62 |
+
For more details, visit our [GitHub repository](https://github.com/leggedrobotics/defm).
|
| 63 |
|
| 64 |
## 📊 Model Zoo
|
| 65 |
|