facebook
/

sparsh-x-all

sparsh-x

img-mic-imu-pressure

base

Model card Files Files and versions

xet

Community

carohiguera commited on Aug 15, 2025

Commit

4823360

1 Parent(s): 3e0a8f9

Adding model ckpt and readme

Browse files

Files changed (2) hide show

README.md +1 -1
assets/{2.arch.png → arch.png} +0 -0

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ Disclaimer: This model card was written by the Sparsh-X authors. The Transformer
 ## Model description
 Sparsh-X is a transformer-based backbone where each input signal is first processed independently for $L_f$ layers through self-attention. Thereafter, we allow cross-modal information flow via attention bottlenecks. Specifically, we concatenate $B$ bottleneck fusion tokens to each modality’s embedding for the subsequent $Lb$ blocks. After each cross-modal update, the fusion tokens are averaged across modalities to promote information sharing. Intuitively, the bottleneck tokens act as multimodal summarizers, distilling and exchanging information between tactile modalities within each transformer block.
-![](assets/2.arch.png)
 The inputs to Sparsh-X are image, audio, accelerometer, and pressure recorded by the Digit 360 sensor.  Tactile images are sampled at 30fps and passed to the model with a temporal stride of 5 concatenated along the channel dimension. We crop to zoom-in the fish-eye image and resize to 224 × 224 × 3. Image patches (16 × 16) are then tokenized to embeddings of 768 dimensions through a linear projection layer. Audio comes from two contact microphones sampled at 48kHz. A 0.55s window of audio signal is converted into a log-mel spectogram of 128 channels computed from a 5ms Hamming window with hop length 2.5ms. We concatenate the spectograms from both microphones, resulting into an audio input of 224 × 256 which is further tokenized with a patch size of 16. IMU data from the 3-axis accelerometer is sampled at 400Hz and combined in a window of 0.55s. The pressure signal is sampled at 200Hz and combined in a window of 1.1s window. Both signals are tokenized resulting in 224 × 3 and 224 × 1 temporal signals.

 ## Model description
 Sparsh-X is a transformer-based backbone where each input signal is first processed independently for $L_f$ layers through self-attention. Thereafter, we allow cross-modal information flow via attention bottlenecks. Specifically, we concatenate $B$ bottleneck fusion tokens to each modality’s embedding for the subsequent $Lb$ blocks. After each cross-modal update, the fusion tokens are averaged across modalities to promote information sharing. Intuitively, the bottleneck tokens act as multimodal summarizers, distilling and exchanging information between tactile modalities within each transformer block.
+![](assets/arch.png)
 The inputs to Sparsh-X are image, audio, accelerometer, and pressure recorded by the Digit 360 sensor.  Tactile images are sampled at 30fps and passed to the model with a temporal stride of 5 concatenated along the channel dimension. We crop to zoom-in the fish-eye image and resize to 224 × 224 × 3. Image patches (16 × 16) are then tokenized to embeddings of 768 dimensions through a linear projection layer. Audio comes from two contact microphones sampled at 48kHz. A 0.55s window of audio signal is converted into a log-mel spectogram of 128 channels computed from a 5ms Hamming window with hop length 2.5ms. We concatenate the spectograms from both microphones, resulting into an audio input of 224 × 256 which is further tokenized with a patch size of 16. IMU data from the 3-axis accelerometer is sampled at 400Hz and combined in a window of 0.55s. The pressure signal is sampled at 200Hz and combined in a window of 1.1s window. Both signals are tokenized resulting in 224 × 3 and 224 × 1 temporal signals.

assets/{2.arch.png → arch.png} RENAMED Viewed

File without changes