File size: 10,807 Bytes
23a0f7f 928140f 23a0f7f 928140f d9bbdd7 928140f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 | ---
language:
- en
license: mit
tags:
- slip
- time-series
- sensor
- multimodal
- contrastive-learning
- custom_code
base_model:
- google/gemma-3-270m
datasets:
- LeoChen085/SlipDataset
- LeoChen085/SlipSFTDataset
pipeline_tag: feature-extraction
---
# SLIP: Sensor Language-Informed Pretraining
**Learning Transferable Sensor Models via Language-Informed Pretraining**
Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell
*Dartmouth College*
[[Paper]](asset/manuscript.pdf) [[Code]](https://github.com/yuc0805/SLIP) [[Dataset]](https://huggingface.co/datasets/LeoChen085/SlipDataset) [[SFT Dataset]](https://huggingface.co/datasets/LeoChen085/SlipSFTDataset)
---
## Overview
SLIP is a multimodal pretraining framework that learns language-aligned sensor representations transferable across diverse sensor setups. It integrates CLIP-style contrastive alignment with sensor-conditioned captioning, enabling both discriminative understanding and generative reasoning over multivariate time series from heterogeneous sensors.
**Key features:**
- **FlexMLP**: A weight-sharing patch embedding that dynamically adapts to different temporal resolutions and variable-length inputs without retraining
- **Repurposed decoder-only LLM**: Splits a pretrained Gemma-3-270M into a unimodal text encoder (first 12 layers) and a multimodal decoder (last 6 layers with cross-attention), enabling efficient sensor-conditioned text generation
- **Contrastive + Captioning pretraining**: Joint CLIP-style contrastive loss and autoregressive captioning loss for both discriminative and generative capabilities
- **Cross-domain transfer**: Pretrained on 600K+ sensor-caption pairs (~1B time points) spanning health, environment, IoT, energy, and transportation
## Architecture
SLIP comprises four components:
1. **Sensor Encoder** (120M params): Transformer with FlexMLP patch embedding and 2D RoPE for cross-sensor and long-range temporal interactions
2. **Sensor Pooler**: Attention pooling with 65 learnable queries (1 CLS + 64 caption tokens) compressing variable-length sensor tokens to fixed-size representations
3. **Text Encoder**: First 12 layers of Gemma-3-270M (last 4 layers unfrozen during pretraining)
4. **Multimodal Decoder**: Last 6 layers of Gemma-3-270M extended with cross-attention for sensor-conditioned generation
**Total: ~220M parameters, 67M trainable.**
## Results
| Task | Metric | Score |
|------|--------|-------|
| Linear Probing (11 datasets avg.) | Accuracy | 77.14% |
| Sensor-based QA | Accuracy | 64.83% |
| Sensor Captioning | BERTScore | 0.887 |
Linear probing accuracy represents a **5.93% relative improvement** over baselines across 11 diverse datasets.
## Checkpoints
| File | Description |
|------|-------------|
| `model.safetensors` | Pretrained SLIP base model |
| `har.safetensors` | SFT for HAR chain-of-thought QA |
| `sleep.safetensors` | SFT for Sleep stage chain-of-thought QA |
| `ecg.safetensors` | SFT for ECG-QA chain-of-thought QA |
| `tsqa.safetensors` | SFT for time series QA |
| `caption.safetensors` | SFT for M4 sensor captioning |
## Installation
```bash
conda create -n slip python=3.10 -y && conda activate slip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirement.txt
```
Download checkpoints:
```python
from huggingface_hub import hf_hub_download
hf_hub_download("LeoChen085/SLIP", "SLIP_gemma270.pth", local_dir="ckpt")
# Optional: task-specific SFT checkpoints
for name in ["har", "sleep", "ecg", "tsqa", "caption"]:
hf_hub_download("LeoChen085/SLIP", f"{name}.safetensors", local_dir="ckpt")
```
## Quick Start
### Load Model
```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
model.eval()
```
### Get Contrastive Embeddings
```python
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Build sensor input (flexi-patch format)
batch_size, num_vars, num_patches, patch_size = 2, 3, 10, 16
sensor_ids, sensor_masks, sensor_times = [], [], []
for _ in range(batch_size):
vars_x, vars_m, vars_t = [], [], []
for _ in range(num_vars):
vars_x.append(torch.randn(num_patches, patch_size, device=device))
vars_m.append(torch.ones(num_patches, patch_size, device=device))
vars_t.append(
torch.linspace(0, 1, num_patches, device=device)
.unsqueeze(-1).expand(num_patches, patch_size)
)
sensor_ids.append(vars_x)
sensor_masks.append(vars_m)
sensor_times.append(vars_t)
sensors = {
"input_ids": sensor_ids,
"attention_mask": sensor_masks,
"time_index": sensor_times,
}
queries = ["Describe the pattern of this sensor data.", "What activity is this?"]
tok = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=64)
text = {k: v.to(device) for k, v in tok.items()}
with torch.no_grad():
text_emb, sensor_emb = model.get_embedding(text, sensors)
# text_emb / sensor_emb shape: (batch_size, 640)
sim = torch.nn.functional.cosine_similarity(text_emb, sensor_emb)
print(f"Cosine similarity: {sim.tolist()}")
```
### Generate Text Conditioned on Sensor Data
```python
prompt = "This sensor reading indicates"
gen_tok = tokenizer([prompt] * batch_size, return_tensors="pt", padding=True)
gen_text = {k: v.to(device) for k, v in gen_tok.items()}
with torch.no_grad():
output_ids = model.generate(gen_text, sensors, max_new_tokens=50)
for i, ids in enumerate(output_ids):
print(f"Sample {i}: {tokenizer.decode(ids, skip_special_tokens=True)}")
```
### Get Sensor-Only Embeddings (No Text Needed)
```python
with torch.no_grad():
sensor_emb = model.get_sensor_embedding(
input_ids=sensors["input_ids"],
mask=sensors["attention_mask"],
time_index=sensors["time_index"],
)
# sensor_emb shape: (batch_size, 640)
```
### Load Task-Specific SFT Checkpoint
```python
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
result = model.load_state_dict(load_file(har_path, device=str(device)), strict=False)
print(f"Loaded HAR checkpoint — missing: {len(result.missing_keys)}, unexpected: {len(result.unexpected_keys)}")
```
### SFT Inference: Question Answering over Sensor Data
The SFT checkpoints enable natural-language Q&A directly on sensor signals. Each sample pairs a multivariate time series with a formatted prompt; the model generates a chain-of-thought reasoning trace followed by the final answer.
**Input format** (from the SFT dataset):
```
[sensor description / context]
Question: <question about the sensor data>
Answer:
```
The model continues from `Answer:` and produces the full response.
**End-to-end inference example** (using `har_cot` as an example task):
```python
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from torch.utils.data import DataLoader
from util.dataset import SftDataset, SFTCollator
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Load base model and tokenizer
model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
model.eval().to(device)
# 2. Swap in the HAR SFT checkpoint
har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
model.load_state_dict(load_file(har_path, device=str(device)), strict=False)
# 3. Load SFT test data (auto-downloaded from HuggingFace)
test_set = SftDataset("har_cot", split="test", hf_repo="LeoChen085/SlipSFTDataset")
# is_test=True feeds only the prompt; answer is held out for evaluation
loader = DataLoader(test_set, batch_size=8,
collate_fn=SFTCollator(tokenizer, max_len=2880, is_test=True))
batch = next(iter(loader))
sensor = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["sensor"].items()}
text = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["text"].items()}
# 4. Generate the answer
with torch.no_grad():
output_ids = model.generate(text, sensor, max_new_tokens=200)
# Strip the prompt from the output — keep only the newly generated tokens
prompts = tokenizer.batch_decode(text["input_ids"], skip_special_tokens=True)
answers = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
ground_truths = text["labels"] # list of strings when is_test=True
idx = 3
answer_only = answers[idx][len(prompts[idx]):].strip()
print("=== Model answer ===")
print(answer_only)
# The accelerometer data over the 2.56 second window shows relatively low
# variability and consistent patterns across the X, Y, and Z axes. The lack of
# large, rapid changes in acceleration across all axes suggests minimal physical
# activity, consistent with a stationary position. Answer: sitting.
print("\n=== Ground truth ===")
print(ground_truths[idx])
# The sustained low variability following the initial adjustment is characteristic
# of a sedentary behavior. Answer: sitting.
```
**Available SFT tasks and their checkpoints:**
| Task | Checkpoint | Description |
|------|-----------|-------------|
| `har_cot` | `har.safetensors` | Human activity recognition with chain-of-thought (walking, running, cycling, …) |
| `sleep_cot` | `sleep.safetensors` | Sleep stage classification with CoT (Wake, N1, N2, N3, REM) |
| `ecg_cot` | `ecg.safetensors` | ECG morphology QA with CoT (normal/abnormal, rhythm, intervals) |
| `tsqa` | `tsqa.safetensors` | General time-series multiple-choice QA |
| `m4_caption` | `caption.safetensors` | Free-form natural-language captioning of M4 sensor traces |
Replace `"har_cot"` / `"har.safetensors"` with any row from the table above to switch tasks.
## Evaluation Datasets
The 11 evaluation datasets span four domains:
| Domain | Datasets |
|--------|----------|
| Activity Recognition | WISDM, UCI-HAR |
| Clinical Diagnosis | Stroke (PPG_CVA), Diabetes (PPG_DM), Hypertension (PPG_HTN), Sleep Stage (sleepEDF), Heart Condition (ptbxl) |
| Stress Prediction | WESAD, StudentLife |
| Urban Sensing | AsphaltObstacles, Beijing AQI |
## Citation
```bibtex
@article{chen2026slip,
title={Learning Transferable Sensor Models via Language-Informed Pretraining},
author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew},
journal={Preprint},
year={2026}
}
```
## License
This project is licensed under the [MIT License](LICENSE).
|