SLIP / README.md

Update README.md

928140f verified 5 days ago

10.8 kB

	---
	language:
	- en
	license: mit
	tags:
	- slip
	- time-series
	- sensor
	- multimodal
	- contrastive-learning
	- custom_code
	base_model:
	- google/gemma-3-270m
	datasets:
	- LeoChen085/SlipDataset
	- LeoChen085/SlipSFTDataset
	pipeline_tag: feature-extraction
	---

	# SLIP: Sensor Language-Informed Pretraining

	Learning Transferable Sensor Models via Language-Informed Pretraining

	Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell

	Dartmouth College

	[[Paper]](asset/manuscript.pdf) [[Code]](https://github.com/LeoChen085/SLIP) [[Dataset]](https://huggingface.co/datasets/LeoChen085/SlipDataset) [[SFT Dataset]](https://huggingface.co/datasets/LeoChen085/SlipSFTDataset)

	---

	## Overview

	SLIP is a multimodal pretraining framework that learns language-aligned sensor representations transferable across diverse sensor setups. It integrates CLIP-style contrastive alignment with sensor-conditioned captioning, enabling both discriminative understanding and generative reasoning over multivariate time series from heterogeneous sensors.

	Key features:
	- FlexMLP: A weight-sharing patch embedding that dynamically adapts to different temporal resolutions and variable-length inputs without retraining
	- Repurposed decoder-only LLM: Splits a pretrained Gemma-3-270M into a unimodal text encoder (first 12 layers) and a multimodal decoder (last 6 layers with cross-attention), enabling efficient sensor-conditioned text generation
	- Contrastive + Captioning pretraining: Joint CLIP-style contrastive loss and autoregressive captioning loss for both discriminative and generative capabilities
	- Cross-domain transfer: Pretrained on 600K+ sensor-caption pairs (~1B time points) spanning health, environment, IoT, energy, and transportation

	## Architecture

	SLIP comprises four components:

	1. Sensor Encoder (120M params): Transformer with FlexMLP patch embedding and 2D RoPE for cross-sensor and long-range temporal interactions
	2. Sensor Pooler: Attention pooling with 65 learnable queries (1 CLS + 64 caption tokens) compressing variable-length sensor tokens to fixed-size representations
	3. Text Encoder: First 12 layers of Gemma-3-270M (last 4 layers unfrozen during pretraining)
	4. Multimodal Decoder: Last 6 layers of Gemma-3-270M extended with cross-attention for sensor-conditioned generation

	Total: ~220M parameters, 67M trainable.

	## Results

	\| Task \| Metric \| Score \|
	\|------\|--------\|-------\|
	\| Linear Probing (11 datasets avg.) \| Accuracy \| 77.14% \|
	\| Sensor-based QA \| Accuracy \| 64.83% \|
	\| Sensor Captioning \| BERTScore \| 0.887 \|

	Linear probing accuracy represents a 5.93% relative improvement over baselines across 11 diverse datasets.

	## Checkpoints

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.safetensors` \| Pretrained SLIP base model \|
	\| `har.safetensors` \| SFT for HAR chain-of-thought QA \|
	\| `sleep.safetensors` \| SFT for Sleep stage chain-of-thought QA \|
	\| `ecg.safetensors` \| SFT for ECG-QA chain-of-thought QA \|
	\| `tsqa.safetensors` \| SFT for time series QA \|
	\| `caption.safetensors` \| SFT for M4 sensor captioning \|

	## Installation

	```bash
	conda create -n slip python=3.10 -y && conda activate slip
	pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
	pip install -r requirement.txt
	```

	Download checkpoints:

	```python
	from huggingface_hub import hf_hub_download

	hf_hub_download("LeoChen085/SLIP", "SLIP_gemma270.pth", local_dir="ckpt")

	# Optional: task-specific SFT checkpoints
	for name in ["har", "sleep", "ecg", "tsqa", "caption"]:
	hf_hub_download("LeoChen085/SLIP", f"{name}.safetensors", local_dir="ckpt")
	```

	## Quick Start

	### Load Model

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
	model.eval()
	```

	### Get Contrastive Embeddings

	```python
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device)

	# Build sensor input (flexi-patch format)
	batch_size, num_vars, num_patches, patch_size = 2, 3, 10, 16
	sensor_ids, sensor_masks, sensor_times = [], [], []
	for _ in range(batch_size):
	vars_x, vars_m, vars_t = [], [], []
	for _ in range(num_vars):
	vars_x.append(torch.randn(num_patches, patch_size, device=device))
	vars_m.append(torch.ones(num_patches, patch_size, device=device))
	vars_t.append(
	torch.linspace(0, 1, num_patches, device=device)
	.unsqueeze(-1).expand(num_patches, patch_size)
	)
	sensor_ids.append(vars_x)
	sensor_masks.append(vars_m)
	sensor_times.append(vars_t)

	sensors = {
	"input_ids": sensor_ids,
	"attention_mask": sensor_masks,
	"time_index": sensor_times,
	}

	queries = ["Describe the pattern of this sensor data.", "What activity is this?"]
	tok = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=64)
	text = {k: v.to(device) for k, v in tok.items()}

	with torch.no_grad():
	text_emb, sensor_emb = model.get_embedding(text, sensors)

	# text_emb / sensor_emb shape: (batch_size, 640)
	sim = torch.nn.functional.cosine_similarity(text_emb, sensor_emb)
	print(f"Cosine similarity: {sim.tolist()}")
	```

	### Generate Text Conditioned on Sensor Data

	```python
	prompt = "This sensor reading indicates"
	gen_tok = tokenizer([prompt] * batch_size, return_tensors="pt", padding=True)
	gen_text = {k: v.to(device) for k, v in gen_tok.items()}

	with torch.no_grad():
	output_ids = model.generate(gen_text, sensors, max_new_tokens=50)

	for i, ids in enumerate(output_ids):
	print(f"Sample {i}: {tokenizer.decode(ids, skip_special_tokens=True)}")
	```

	### Get Sensor-Only Embeddings (No Text Needed)

	```python
	with torch.no_grad():
	sensor_emb = model.get_sensor_embedding(
	input_ids=sensors["input_ids"],
	mask=sensors["attention_mask"],
	time_index=sensors["time_index"],
	)
	# sensor_emb shape: (batch_size, 640)
	```

	### Load Task-Specific SFT Checkpoint

	```python
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file

	har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
	result = model.load_state_dict(load_file(har_path, device=str(device)), strict=False)
	print(f"Loaded HAR checkpoint — missing: {len(result.missing_keys)}, unexpected: {len(result.unexpected_keys)}")
	```

	### SFT Inference: Question Answering over Sensor Data

	The SFT checkpoints enable natural-language Q&A directly on sensor signals. Each sample pairs a multivariate time series with a formatted prompt; the model generates a chain-of-thought reasoning trace followed by the final answer.

	Input format (from the SFT dataset):
	```
	[sensor description / context]
	Question: <question about the sensor data>
	Answer:
	```
	The model continues from `Answer:` and produces the full response.

	End-to-end inference example (using `har_cot` as an example task):

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from torch.utils.data import DataLoader
	from util.dataset import SftDataset, SFTCollator

	device = "cuda" if torch.cuda.is_available() else "cpu"

	# 1. Load base model and tokenizer
	model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
	model.eval().to(device)

	# 2. Swap in the HAR SFT checkpoint
	har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors")
	model.load_state_dict(load_file(har_path, device=str(device)), strict=False)

	# 3. Load SFT test data (auto-downloaded from HuggingFace)
	test_set = SftDataset("har_cot", split="test", hf_repo="LeoChen085/SlipSFTDataset")
	# is_test=True feeds only the prompt; answer is held out for evaluation
	loader = DataLoader(test_set, batch_size=8,
	collate_fn=SFTCollator(tokenizer, max_len=2880, is_test=True))

	batch = next(iter(loader))
	sensor = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["sensor"].items()}
	text = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["text"].items()}

	# 4. Generate the answer
	with torch.no_grad():
	output_ids = model.generate(text, sensor, max_new_tokens=200)

	# Strip the prompt from the output — keep only the newly generated tokens
	prompts = tokenizer.batch_decode(text["input_ids"], skip_special_tokens=True)
	answers = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
	ground_truths = text["labels"] # list of strings when is_test=True

	idx = 3
	answer_only = answers[idx][len(prompts[idx]):].strip()

	print("=== Model answer ===")
	print(answer_only)
	# The accelerometer data over the 2.56 second window shows relatively low
	# variability and consistent patterns across the X, Y, and Z axes. The lack of
	# large, rapid changes in acceleration across all axes suggests minimal physical
	# activity, consistent with a stationary position. Answer: sitting.

	print("\n=== Ground truth ===")
	print(ground_truths[idx])
	# The sustained low variability following the initial adjustment is characteristic
	# of a sedentary behavior. Answer: sitting.
	```

	Available SFT tasks and their checkpoints:

	\| Task \| Checkpoint \| Description \|
	\|------\|-----------\|-------------\|
	\| `har_cot` \| `har.safetensors` \| Human activity recognition with chain-of-thought (walking, running, cycling, …) \|
	\| `sleep_cot` \| `sleep.safetensors` \| Sleep stage classification with CoT (Wake, N1, N2, N3, REM) \|
	\| `ecg_cot` \| `ecg.safetensors` \| ECG morphology QA with CoT (normal/abnormal, rhythm, intervals) \|
	\| `tsqa` \| `tsqa.safetensors` \| General time-series multiple-choice QA \|
	\| `m4_caption` \| `caption.safetensors` \| Free-form natural-language captioning of M4 sensor traces \|

	Replace `"har_cot"` / `"har.safetensors"` with any row from the table above to switch tasks.

	## Evaluation Datasets

	The 11 evaluation datasets span four domains:

	\| Domain \| Datasets \|
	\|--------\|----------\|
	\| Activity Recognition \| WISDM, UCI-HAR \|
	\| Clinical Diagnosis \| Stroke (PPG_CVA), Diabetes (PPG_DM), Hypertension (PPG_HTN), Sleep Stage (sleepEDF), Heart Condition (ptbxl) \|
	\| Stress Prediction \| WESAD, StudentLife \|
	\| Urban Sensing \| AsphaltObstacles, Beijing AQI \|

	## Citation

	```bibtex
	@article{chen2026slip,
	title={Learning Transferable Sensor Models via Language-Informed Pretraining},
	author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew},
	journal={Preprint},
	year={2026}
	}
	```

	## License

	This project is licensed under the [MIT License](LICENSE).