xianghuix xiexh20 commited on
Commit
8772e96
·
1 Parent(s): feec897

Upload README.md (#1)

Browse files

- Upload README.md (d1d6b48552cd714f2f233a66866294bc192b6eaf)


Co-authored-by: Xianghui Xie <xiexh20@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +127 -9
README.md CHANGED
@@ -1,9 +1,127 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- language:
4
- - en
5
- pipeline_tag: image-to-3d
6
- tags:
7
- - human-object-interaction
8
- - robotics
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center"><b>Model Card - CARI4D</b></p>
2
+
3
+ # Overview
4
+
5
+ ## Description:
6
+ The CoCoNet model, a part of the CARI4D method, refines the initial human and object pose parameters obtained from foundation models in human and object pose estimation. It additionally predicts binary contact labels to help downstream applications. This is a transformer based model that is agnostic to specific object category.
7
+ This model is for research and development only.
8
+
9
+
10
+ ### License/Terms of Use:
11
+ Governing Terms: NVIDIA License. Additional Information:  https://github.com/facebookresearch/dinov2/blob/main/LICENSE.
12
+
13
+
14
+ ### Deployment Geography:
15
+ Global
16
+
17
+ ### Use Case:
18
+ Researchers and developers in the field of computer vision, VR/AR and robotics, specifically those interested in building intelligent humanoid robots, are expected to use this method for tasks such as 4D reconstruction, interaction data collection, and humanoid robot learning.
19
+
20
+ ### Release Date:
21
+ **Github:** [02/28/2026] via [https://github.com/NVlabs/CARI4D]
22
+
23
+ ## Reference(s):
24
+ [CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction] (https://arxiv.org/abs/2512.11988), Sec 3.3.
25
+
26
+ ## Model Architecture:
27
+ **Architecture Type:** Transformers and convolutional neural networks (CNNs).
28
+ **Network Architecture:** The network contains three parts: 1) Input image encoder (DINO v2) 2) Set of blocks (CNNs and transformers) that performs matching and comparison with long-range dependencies. 3) A set of multilayer perceptions to predict updated human object poses.
29
+
30
+ **Number of model parameters:** 194M
31
+
32
+
33
+ ## Input:
34
+ **Input Type(s):** Two sequences of RGB, xyz map, and human-object masks.
35
+ **Input Format(s):** Red, Green, Blue (RGB, float), xyz map (float) and masks (binary)
36
+ **Input Parameters:** The input are two sequences of images consisting of RGB, xyz map and masks. Each sequence is a 4-dimensional tensor. One sequence comes from input observation (actual RGB videos) and another one comes from synthetic renderings of the initial human-object estimations.
37
+
38
+ **Other Properties Related to Input:** More specifically, the inputs have the following dimensions:
39
+ - RGB: 2xTx3xHxW
40
+ - xyz: 2xTx3xHxW
41
+ - masks: 2xTx2xHxW
42
+ where T is the length of this sequence and H, W are the image height and width respectively.
43
+
44
+ ## Output:
45
+ **Output Type(s):** Human, object pose parameters, and contact scores for two hands.
46
+ **Output Format:** float, float, float.
47
+ **Output Parameters:** The output of this model includes the updated pose parameters and binary hand contacts for each frame in the input sequence. Each parameter is a 2D array (TxD).
48
+ We use the [SMPL body](https://smpl.is.tue.mpg.de/) representation for the human, and object is represented using rigid rotation and translation parameters.
49
+
50
+ **Other Properties Related to Output:** More specifically, the output parameters have the following dimensions:
51
+ - Human pose (SMPL): Tx144
52
+ - Human shape (SMPL): Tx10
53
+ - Human translation (SMPL): Tx3
54
+ - Object rotation: Tx6
55
+ - Object translation: Tx3
56
+ - Binary contact: Tx2
57
+
58
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
59
+
60
+ ## Software Integration:
61
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
62
+
63
+ **Runtime Engine(s):**
64
+ * N/A
65
+
66
+ **Supported Hardware Microarchitecture Compatibility:**
67
+ NVIDIA Ampere
68
+
69
+ **[Preferred/Supported] Operating System(s):**
70
+ * Linux
71
+
72
+
73
+ ## Model Version(s):
74
+ v1.0: Initial model version with full capabilities, unpruned and trained.
75
+
76
+
77
+ ## Training and Evaluation Datasets:
78
+
79
+ ## Training Dataset:
80
+ **Link:** [BEHAVE](https://virtualhumans.mpi-inf.mpg.de/behave/), [HODome](https://juzezhang.github.io/NeuralDome/)
81
+
82
+ **Data Modality:**
83
+ * Image
84
+ * Video
85
+ * Other: 3D human, object meshes and pose parameters.
86
+
87
+ **Image Training Data Size**
88
+ * 2.4 million images from the videos.
89
+
90
+
91
+ **Video Training Data Size:**
92
+ * 1600 videos
93
+
94
+ **Non-Audio, Image, Text Training Data Size:**
95
+ * Pose parameters corresponding to the 2.4M frames in the video.
96
+
97
+ **Data Collection Method by dataset:**
98
+ [Automatic/Sensors]
99
+
100
+ **Labeling Method by dataset:**
101
+ Hybrid: Automatic/Sensors, Human.
102
+
103
+ **Properties:** The datasets capture diverse human object interaction motions using multi-view RGB or RGBD cameras. Each image is annotated with three dimensional human and object meshes with corresponding pose parameters.
104
+
105
+
106
+ ## Evaluation Dataset:
107
+ **Link:** [BEHAVE](https://virtualhumans.mpi-inf.mpg.de/behave/), [InterCap](https://intercap.is.tue.mpg.de/)
108
+
109
+ **Data Collection Method by dataset:**
110
+ [Automatic/Sensors]
111
+
112
+ **Labeling Method by dataset:**
113
+ Hybrid: Automatic/Sensors, Human.
114
+
115
+ **Properties:** The datasets capture diverse human object interaction motions using multi-view RGB or RGBD cameras. Each image is annotated with three dimensional human and object meshes with corresponding pose parameters.
116
+
117
+
118
+ ## Inference:
119
+ **Acceleration Engine:** Tensor(RT)
120
+
121
+ **Test Hardware:**
122
+ * Zed Stereo Camera, 4090
123
+
124
+ ## Ethical Considerations:
125
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
126
+
127
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).