File size: 10,807 Bytes
23a0f7f
 
 
928140f
 
 
 
 
 
 
 
23a0f7f
 
928140f
 
 
 
 
 
 
 
 
 
 
 
 
 
d9bbdd7
928140f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
language:
- en
license: mit
tags:
- slip
- time-series
- sensor
- multimodal
- contrastive-learning
- custom_code
base_model:
- google/gemma-3-270m
datasets:
- LeoChen085/SlipDataset
- LeoChen085/SlipSFTDataset
pipeline_tag: feature-extraction
---

# SLIP: Sensor Language-Informed Pretraining

**Learning Transferable Sensor Models via Language-Informed Pretraining**

Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell

*Dartmouth College*

[[Paper]](asset/manuscript.pdf) [[Code]](https://github.com/yuc0805/SLIP) [[Dataset]](https://huggingface.co/datasets/LeoChen085/SlipDataset) [[SFT Dataset]](https://huggingface.co/datasets/LeoChen085/SlipSFTDataset)

---

## Overview

SLIP is a multimodal pretraining framework that learns language-aligned sensor representations transferable across diverse sensor setups. It integrates CLIP-style contrastive alignment with sensor-conditioned captioning, enabling both discriminative understanding and generative reasoning over multivariate time series from heterogeneous sensors.

**Key features:**
- **FlexMLP**: A weight-sharing patch embedding that dynamically adapts to different temporal resolutions and variable-length inputs without retraining
- **Repurposed decoder-only LLM**: Splits a pretrained Gemma-3-270M into a unimodal text encoder (first 12 layers) and a multimodal decoder (last 6 layers with cross-attention), enabling efficient sensor-conditioned text generation
- **Contrastive + Captioning pretraining**: Joint CLIP-style contrastive loss and autoregressive captioning loss for both discriminative and generative capabilities
- **Cross-domain transfer**: Pretrained on 600K+ sensor-caption pairs (~1B time points) spanning health, environment, IoT, energy, and transportation

## Architecture

SLIP comprises four components:

1. **Sensor Encoder** (120M params): Transformer with FlexMLP patch embedding and 2D RoPE for cross-sensor and long-range temporal interactions
2. **Sensor Pooler**: Attention pooling with 65 learnable queries (1 CLS + 64 caption tokens) compressing variable-length sensor tokens to fixed-size representations
3. **Text Encoder**: First 12 layers of Gemma-3-270M (last 4 layers unfrozen during pretraining)
4. **Multimodal Decoder**: Last 6 layers of Gemma-3-270M extended with cross-attention for sensor-conditioned generation

**Total: ~220M parameters, 67M trainable.**

## Results

| Task | Metric | Score |
|------|--------|-------|
| Linear Probing (11 datasets avg.) | Accuracy | 77.14% |
| Sensor-based QA | Accuracy | 64.83% |
| Sensor Captioning | BERTScore | 0.887 |

Linear probing accuracy represents a **5.93% relative improvement** over baselines across 11 diverse datasets.

## Checkpoints

| File | Description |
|------|-------------|
| `model.safetensors` | Pretrained SLIP base model |
| `har.safetensors` | SFT for HAR chain-of-thought QA |
| `sleep.safetensors` | SFT for Sleep stage chain-of-thought QA |
| `ecg.safetensors` | SFT for ECG-QA chain-of-thought QA |
| `tsqa.safetensors` | SFT for time series QA |
| `caption.safetensors` | SFT for M4 sensor captioning |

## Installation

```bash
conda create -n slip python=3.10 -y && conda activate slip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirement.txt
```

Download checkpoints:

```python
from huggingface_hub import hf_hub_download

hf_hub_download("LeoChen085/SLIP", "SLIP_gemma270.pth", local_dir="ckpt")

# Optional: task-specific SFT checkpoints
for name in ["har", "sleep", "ecg", "tsqa", "caption"]:
    hf_hub_download("LeoChen085/SLIP", f"{name}.safetensors", local_dir="ckpt")
```

## Quick Start

### Load Model

```python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
model.eval()
```

### Get Contrastive Embeddings

```python
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Build sensor input (flexi-patch format)
batch_size, num_vars, num_patches, patch_size = 2, 3, 10, 16
sensor_ids, sensor_masks, sensor_times = [], [], []
for _ in range(batch_size):
    vars_x, vars_m, vars_t = [], [], []
    for _ in range(num_vars):
        vars_x.append(torch.randn(num_patches, patch_size, device=device))
        vars_m.append(torch.ones(num_patches, patch_size, device=device))
        vars_t.append(
            torch.linspace(0, 1, num_patches, device=device)
            .unsqueeze(-1).expand(num_patches, patch_size)
        )
    sensor_ids.append(vars_x)
    sensor_masks.append(vars_m)
    sensor_times.append(vars_t)

sensors = {
    "input_ids": sensor_ids,
    "attention_mask": sensor_masks,
    "time_index": sensor_times,
}

queries = ["Describe the pattern of this sensor data.", "What activity is this?"]
tok = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=64)
text = {k: v.to(device) for k, v in tok.items()}

with torch.no_grad():
    text_emb, sensor_emb = model.get_embedding(text, sensors)

# text_emb / sensor_emb shape: (batch_size, 640)
sim = torch.nn.functional.cosine_similarity(text_emb, sensor_emb)
print(f"Cosine similarity: {sim.tolist()}")
```

### Generate Text Conditioned on Sensor Data

```python
prompt = "This sensor reading indicates"
gen_tok = tokenizer([prompt] * batch_size, return_tensors="pt", padding=True)
gen_text = {k: v.to(device) for k, v in gen_tok.items()}

with torch.no_grad():
    output_ids = model.generate(gen_text, sensors, max_new_tokens=50)

for i, ids in enumerate(output_ids):
    print(f"Sample {i}: {tokenizer.decode(ids, skip_special_tokens=True)}")
```

### Get Sensor-Only Embeddings (No Text Needed)

```python
with torch.no_grad():
    sensor_emb = model.get_sensor_embedding(
        input_ids=sensors["input_ids"],
        mask=sensors["attention_mask"],
        time_index=sensors["time_index"],
    )
# sensor_emb shape: (batch_size, 640)
```

### Load Task-Specific SFT Checkpoint

```python
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
result = model.load_state_dict(load_file(har_path, device=str(device)), strict=False)
print(f"Loaded HAR checkpoint — missing: {len(result.missing_keys)}, unexpected: {len(result.unexpected_keys)}")
```

### SFT Inference: Question Answering over Sensor Data

The SFT checkpoints enable natural-language Q&A directly on sensor signals. Each sample pairs a multivariate time series with a formatted prompt; the model generates a chain-of-thought reasoning trace followed by the final answer.

**Input format** (from the SFT dataset):
```
[sensor description / context]
Question: <question about the sensor data>
Answer:
```
The model continues from `Answer:` and produces the full response.

**End-to-end inference example** (using `har_cot` as an example task):

```python
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from torch.utils.data import DataLoader
from util.dataset import SftDataset, SFTCollator

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load base model and tokenizer
model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
model.eval().to(device)

# 2. Swap in the HAR SFT checkpoint
har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
model.load_state_dict(load_file(har_path, device=str(device)), strict=False)

# 3. Load SFT test data (auto-downloaded from HuggingFace)
test_set = SftDataset("har_cot", split="test", hf_repo="LeoChen085/SlipSFTDataset")
# is_test=True feeds only the prompt; answer is held out for evaluation
loader = DataLoader(test_set, batch_size=8,
                    collate_fn=SFTCollator(tokenizer, max_len=2880, is_test=True))

batch = next(iter(loader))
sensor = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["sensor"].items()}
text   = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["text"].items()}

# 4. Generate the answer
with torch.no_grad():
    output_ids = model.generate(text, sensor, max_new_tokens=200)

# Strip the prompt from the output — keep only the newly generated tokens
prompts      = tokenizer.batch_decode(text["input_ids"], skip_special_tokens=True)
answers      = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
ground_truths = text["labels"]   # list of strings when is_test=True

idx = 3
answer_only = answers[idx][len(prompts[idx]):].strip()

print("=== Model answer ===")
print(answer_only)
# The accelerometer data over the 2.56 second window shows relatively low
# variability and consistent patterns across the X, Y, and Z axes. The lack of
# large, rapid changes in acceleration across all axes suggests minimal physical
# activity, consistent with a stationary position. Answer: sitting.

print("\n=== Ground truth ===")
print(ground_truths[idx])
# The sustained low variability following the initial adjustment is characteristic
# of a sedentary behavior. Answer: sitting.
```

**Available SFT tasks and their checkpoints:**

| Task | Checkpoint | Description |
|------|-----------|-------------|
| `har_cot` | `har.safetensors` | Human activity recognition with chain-of-thought (walking, running, cycling, …) |
| `sleep_cot` | `sleep.safetensors` | Sleep stage classification with CoT (Wake, N1, N2, N3, REM) |
| `ecg_cot` | `ecg.safetensors` | ECG morphology QA with CoT (normal/abnormal, rhythm, intervals) |
| `tsqa` | `tsqa.safetensors` | General time-series multiple-choice QA |
| `m4_caption` | `caption.safetensors` | Free-form natural-language captioning of M4 sensor traces |

Replace `"har_cot"` / `"har.safetensors"` with any row from the table above to switch tasks.

## Evaluation Datasets

The 11 evaluation datasets span four domains:

| Domain | Datasets |
|--------|----------|
| Activity Recognition | WISDM, UCI-HAR |
| Clinical Diagnosis | Stroke (PPG_CVA), Diabetes (PPG_DM), Hypertension (PPG_HTN), Sleep Stage (sleepEDF), Heart Condition (ptbxl) |
| Stress Prediction | WESAD, StudentLife |
| Urban Sensing | AsphaltObstacles, Beijing AQI |

## Citation

```bibtex
@article{chen2026slip,
  title={Learning Transferable Sensor Models via Language-Informed Pretraining},
  author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew},
  journal={Preprint},
  year={2026}
}
```

## License

This project is licensed under the [MIT License](LICENSE).