SLIP: Sensor Language-Informed Pretraining
Learning Transferable Sensor Models via Language-Informed Pretraining
Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell
Dartmouth College
[Paper] [Code] [Dataset] [SFT Dataset]
Overview
SLIP is a multimodal pretraining framework that learns language-aligned sensor representations transferable across diverse sensor setups. It integrates CLIP-style contrastive alignment with sensor-conditioned captioning, enabling both discriminative understanding and generative reasoning over multivariate time series from heterogeneous sensors.
Key features:
- FlexMLP: A weight-sharing patch embedding that dynamically adapts to different temporal resolutions and variable-length inputs without retraining
- Repurposed decoder-only LLM: Splits a pretrained Gemma-3-270M into a unimodal text encoder (first 12 layers) and a multimodal decoder (last 6 layers with cross-attention), enabling efficient sensor-conditioned text generation
- Contrastive + Captioning pretraining: Joint CLIP-style contrastive loss and autoregressive captioning loss for both discriminative and generative capabilities
- Cross-domain transfer: Pretrained on 600K+ sensor-caption pairs (~1B time points) spanning health, environment, IoT, energy, and transportation
Architecture
SLIP comprises four components:
- Sensor Encoder (120M params): Transformer with FlexMLP patch embedding and 2D RoPE for cross-sensor and long-range temporal interactions
- Sensor Pooler: Attention pooling with 65 learnable queries (1 CLS + 64 caption tokens) compressing variable-length sensor tokens to fixed-size representations
- Text Encoder: First 12 layers of Gemma-3-270M (last 4 layers unfrozen during pretraining)
- Multimodal Decoder: Last 6 layers of Gemma-3-270M extended with cross-attention for sensor-conditioned generation
Total: ~220M parameters, 67M trainable.
Results
| Task | Metric | Score |
|---|---|---|
| Linear Probing (11 datasets avg.) | Accuracy | 77.14% |
| Sensor-based QA | Accuracy | 64.83% |
| Sensor Captioning | BERTScore | 0.887 |
Linear probing accuracy represents a 5.93% relative improvement over baselines across 11 diverse datasets.
Checkpoints
| File | Description |
|---|---|
model.safetensors |
Pretrained SLIP base model |
har.safetensors |
SFT for HAR chain-of-thought QA |
sleep.safetensors |
SFT for Sleep stage chain-of-thought QA |
ecg.safetensors |
SFT for ECG-QA chain-of-thought QA |
tsqa.safetensors |
SFT for time series QA |
caption.safetensors |
SFT for M4 sensor captioning |
Installation
conda create -n slip python=3.10 -y && conda activate slip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirement.txt
Download checkpoints:
from huggingface_hub import hf_hub_download
hf_hub_download("LeoChen085/SLIP", "SLIP_gemma270.pth", local_dir="ckpt")
# Optional: task-specific SFT checkpoints
for name in ["har", "sleep", "ecg", "tsqa", "caption"]:
hf_hub_download("LeoChen085/SLIP", f"{name}.safetensors", local_dir="ckpt")
Quick Start
Load Model
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
model.eval()
Get Contrastive Embeddings
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Build sensor input (flexi-patch format)
batch_size, num_vars, num_patches, patch_size = 2, 3, 10, 16
sensor_ids, sensor_masks, sensor_times = [], [], []
for _ in range(batch_size):
vars_x, vars_m, vars_t = [], [], []
for _ in range(num_vars):
vars_x.append(torch.randn(num_patches, patch_size, device=device))
vars_m.append(torch.ones(num_patches, patch_size, device=device))
vars_t.append(
torch.linspace(0, 1, num_patches, device=device)
.unsqueeze(-1).expand(num_patches, patch_size)
)
sensor_ids.append(vars_x)
sensor_masks.append(vars_m)
sensor_times.append(vars_t)
sensors = {
"input_ids": sensor_ids,
"attention_mask": sensor_masks,
"time_index": sensor_times,
}
queries = ["Describe the pattern of this sensor data.", "What activity is this?"]
tok = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=64)
text = {k: v.to(device) for k, v in tok.items()}
with torch.no_grad():
text_emb, sensor_emb = model.get_embedding(text, sensors)
# text_emb / sensor_emb shape: (batch_size, 640)
sim = torch.nn.functional.cosine_similarity(text_emb, sensor_emb)
print(f"Cosine similarity: {sim.tolist()}")
Generate Text Conditioned on Sensor Data
prompt = "This sensor reading indicates"
gen_tok = tokenizer([prompt] * batch_size, return_tensors="pt", padding=True)
gen_text = {k: v.to(device) for k, v in gen_tok.items()}
with torch.no_grad():
output_ids = model.generate(gen_text, sensors, max_new_tokens=50)
for i, ids in enumerate(output_ids):
print(f"Sample {i}: {tokenizer.decode(ids, skip_special_tokens=True)}")
Get Sensor-Only Embeddings (No Text Needed)
with torch.no_grad():
sensor_emb = model.get_sensor_embedding(
input_ids=sensors["input_ids"],
mask=sensors["attention_mask"],
time_index=sensors["time_index"],
)
# sensor_emb shape: (batch_size, 640)
Load Task-Specific SFT Checkpoint
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
result = model.load_state_dict(load_file(har_path, device=str(device)), strict=False)
print(f"Loaded HAR checkpoint — missing: {len(result.missing_keys)}, unexpected: {len(result.unexpected_keys)}")
SFT Inference: Question Answering over Sensor Data
The SFT checkpoints enable natural-language Q&A directly on sensor signals. Each sample pairs a multivariate time series with a formatted prompt; the model generates a chain-of-thought reasoning trace followed by the final answer.
Input format (from the SFT dataset):
[sensor description / context]
Question: <question about the sensor data>
Answer:
The model continues from Answer: and produces the full response.
End-to-end inference example (using har_cot as an example task):
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from torch.utils.data import DataLoader
from util.dataset import SftDataset, SFTCollator
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Load base model and tokenizer
model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
model.eval().to(device)
# 2. Swap in the HAR SFT checkpoint
har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
model.load_state_dict(load_file(har_path, device=str(device)), strict=False)
# 3. Load SFT test data (auto-downloaded from HuggingFace)
test_set = SftDataset("har_cot", split="test", hf_repo="LeoChen085/SlipSFTDataset")
# is_test=True feeds only the prompt; answer is held out for evaluation
loader = DataLoader(test_set, batch_size=8,
collate_fn=SFTCollator(tokenizer, max_len=2880, is_test=True))
batch = next(iter(loader))
sensor = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["sensor"].items()}
text = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["text"].items()}
# 4. Generate the answer
with torch.no_grad():
output_ids = model.generate(text, sensor, max_new_tokens=200)
# Strip the prompt from the output — keep only the newly generated tokens
prompts = tokenizer.batch_decode(text["input_ids"], skip_special_tokens=True)
answers = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
ground_truths = text["labels"] # list of strings when is_test=True
idx = 3
answer_only = answers[idx][len(prompts[idx]):].strip()
print("=== Model answer ===")
print(answer_only)
# The accelerometer data over the 2.56 second window shows relatively low
# variability and consistent patterns across the X, Y, and Z axes. The lack of
# large, rapid changes in acceleration across all axes suggests minimal physical
# activity, consistent with a stationary position. Answer: sitting.
print("\n=== Ground truth ===")
print(ground_truths[idx])
# The sustained low variability following the initial adjustment is characteristic
# of a sedentary behavior. Answer: sitting.
Available SFT tasks and their checkpoints:
| Task | Checkpoint | Description |
|---|---|---|
har_cot |
har.safetensors |
Human activity recognition with chain-of-thought (walking, running, cycling, …) |
sleep_cot |
sleep.safetensors |
Sleep stage classification with CoT (Wake, N1, N2, N3, REM) |
ecg_cot |
ecg.safetensors |
ECG morphology QA with CoT (normal/abnormal, rhythm, intervals) |
tsqa |
tsqa.safetensors |
General time-series multiple-choice QA |
m4_caption |
caption.safetensors |
Free-form natural-language captioning of M4 sensor traces |
Replace "har_cot" / "har.safetensors" with any row from the table above to switch tasks.
Evaluation Datasets
The 11 evaluation datasets span four domains:
| Domain | Datasets |
|---|---|
| Activity Recognition | WISDM, UCI-HAR |
| Clinical Diagnosis | Stroke (PPG_CVA), Diabetes (PPG_DM), Hypertension (PPG_HTN), Sleep Stage (sleepEDF), Heart Condition (ptbxl) |
| Stress Prediction | WESAD, StudentLife |
| Urban Sensing | AsphaltObstacles, Beijing AQI |
Citation
@article{chen2026slip,
title={Learning Transferable Sensor Models via Language-Informed Pretraining},
author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew},
journal={Preprint},
year={2026}
}
License
This project is licensed under the MIT License.
- Downloads last month
- 289
Model tree for LeoChen085/SLIP
Base model
google/gemma-3-270m