| | --- |
| | language: |
| | - en |
| | license: mit |
| | tags: |
| | - slip |
| | - time-series |
| | - sensor |
| | - multimodal |
| | - contrastive-learning |
| | - custom_code |
| | base_model: |
| | - google/gemma-3-270m |
| | datasets: |
| | - LeoChen085/SlipDataset |
| | - LeoChen085/SlipSFTDataset |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # SLIP: Sensor Language-Informed Pretraining |
| |
|
| | **Learning Transferable Sensor Models via Language-Informed Pretraining** |
| |
|
| | Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell |
| |
|
| | *Dartmouth College* |
| |
|
| | [[Paper]](asset/manuscript.pdf) [[Code]](https://github.com/LeoChen085/SLIP) [[Dataset]](https://huggingface.co/datasets/LeoChen085/SlipDataset) [[SFT Dataset]](https://huggingface.co/datasets/LeoChen085/SlipSFTDataset) |
| |
|
| | --- |
| |
|
| | ## Overview |
| |
|
| | SLIP is a multimodal pretraining framework that learns language-aligned sensor representations transferable across diverse sensor setups. It integrates CLIP-style contrastive alignment with sensor-conditioned captioning, enabling both discriminative understanding and generative reasoning over multivariate time series from heterogeneous sensors. |
| |
|
| | **Key features:** |
| | - **FlexMLP**: A weight-sharing patch embedding that dynamically adapts to different temporal resolutions and variable-length inputs without retraining |
| | - **Repurposed decoder-only LLM**: Splits a pretrained Gemma-3-270M into a unimodal text encoder (first 12 layers) and a multimodal decoder (last 6 layers with cross-attention), enabling efficient sensor-conditioned text generation |
| | - **Contrastive + Captioning pretraining**: Joint CLIP-style contrastive loss and autoregressive captioning loss for both discriminative and generative capabilities |
| | - **Cross-domain transfer**: Pretrained on 600K+ sensor-caption pairs (~1B time points) spanning health, environment, IoT, energy, and transportation |
| |
|
| | ## Architecture |
| |
|
| | SLIP comprises four components: |
| |
|
| | 1. **Sensor Encoder** (120M params): Transformer with FlexMLP patch embedding and 2D RoPE for cross-sensor and long-range temporal interactions |
| | 2. **Sensor Pooler**: Attention pooling with 65 learnable queries (1 CLS + 64 caption tokens) compressing variable-length sensor tokens to fixed-size representations |
| | 3. **Text Encoder**: First 12 layers of Gemma-3-270M (last 4 layers unfrozen during pretraining) |
| | 4. **Multimodal Decoder**: Last 6 layers of Gemma-3-270M extended with cross-attention for sensor-conditioned generation |
| |
|
| | **Total: ~220M parameters, 67M trainable.** |
| |
|
| | ## Results |
| |
|
| | | Task | Metric | Score | |
| | |------|--------|-------| |
| | | Linear Probing (11 datasets avg.) | Accuracy | 77.14% | |
| | | Sensor-based QA | Accuracy | 64.83% | |
| | | Sensor Captioning | BERTScore | 0.887 | |
| |
|
| | Linear probing accuracy represents a **5.93% relative improvement** over baselines across 11 diverse datasets. |
| |
|
| | ## Checkpoints |
| |
|
| | | File | Description | |
| | |------|-------------| |
| | | `model.safetensors` | Pretrained SLIP base model | |
| | | `har.safetensors` | SFT for HAR chain-of-thought QA | |
| | | `sleep.safetensors` | SFT for Sleep stage chain-of-thought QA | |
| | | `ecg.safetensors` | SFT for ECG-QA chain-of-thought QA | |
| | | `tsqa.safetensors` | SFT for time series QA | |
| | | `caption.safetensors` | SFT for M4 sensor captioning | |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | conda create -n slip python=3.10 -y && conda activate slip |
| | pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 |
| | pip install -r requirement.txt |
| | ``` |
| |
|
| | Download checkpoints: |
| |
|
| | ```python |
| | from huggingface_hub import hf_hub_download |
| | |
| | hf_hub_download("LeoChen085/SLIP", "SLIP_gemma270.pth", local_dir="ckpt") |
| | |
| | # Optional: task-specific SFT checkpoints |
| | for name in ["har", "sleep", "ecg", "tsqa", "caption"]: |
| | hf_hub_download("LeoChen085/SLIP", f"{name}.safetensors", local_dir="ckpt") |
| | ``` |
| |
|
| | ## Quick Start |
| |
|
| | ### Load Model |
| |
|
| | ```python |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m") |
| | model.eval() |
| | ``` |
| |
|
| | ### Get Contrastive Embeddings |
| |
|
| | ```python |
| | import torch |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | model = model.to(device) |
| | |
| | # Build sensor input (flexi-patch format) |
| | batch_size, num_vars, num_patches, patch_size = 2, 3, 10, 16 |
| | sensor_ids, sensor_masks, sensor_times = [], [], [] |
| | for _ in range(batch_size): |
| | vars_x, vars_m, vars_t = [], [], [] |
| | for _ in range(num_vars): |
| | vars_x.append(torch.randn(num_patches, patch_size, device=device)) |
| | vars_m.append(torch.ones(num_patches, patch_size, device=device)) |
| | vars_t.append( |
| | torch.linspace(0, 1, num_patches, device=device) |
| | .unsqueeze(-1).expand(num_patches, patch_size) |
| | ) |
| | sensor_ids.append(vars_x) |
| | sensor_masks.append(vars_m) |
| | sensor_times.append(vars_t) |
| | |
| | sensors = { |
| | "input_ids": sensor_ids, |
| | "attention_mask": sensor_masks, |
| | "time_index": sensor_times, |
| | } |
| | |
| | queries = ["Describe the pattern of this sensor data.", "What activity is this?"] |
| | tok = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=64) |
| | text = {k: v.to(device) for k, v in tok.items()} |
| | |
| | with torch.no_grad(): |
| | text_emb, sensor_emb = model.get_embedding(text, sensors) |
| | |
| | # text_emb / sensor_emb shape: (batch_size, 640) |
| | sim = torch.nn.functional.cosine_similarity(text_emb, sensor_emb) |
| | print(f"Cosine similarity: {sim.tolist()}") |
| | ``` |
| |
|
| | ### Generate Text Conditioned on Sensor Data |
| |
|
| | ```python |
| | prompt = "This sensor reading indicates" |
| | gen_tok = tokenizer([prompt] * batch_size, return_tensors="pt", padding=True) |
| | gen_text = {k: v.to(device) for k, v in gen_tok.items()} |
| | |
| | with torch.no_grad(): |
| | output_ids = model.generate(gen_text, sensors, max_new_tokens=50) |
| | |
| | for i, ids in enumerate(output_ids): |
| | print(f"Sample {i}: {tokenizer.decode(ids, skip_special_tokens=True)}") |
| | ``` |
| |
|
| | ### Get Sensor-Only Embeddings (No Text Needed) |
| |
|
| | ```python |
| | with torch.no_grad(): |
| | sensor_emb = model.get_sensor_embedding( |
| | input_ids=sensors["input_ids"], |
| | mask=sensors["attention_mask"], |
| | time_index=sensors["time_index"], |
| | ) |
| | # sensor_emb shape: (batch_size, 640) |
| | ``` |
| |
|
| | ### Load Task-Specific SFT Checkpoint |
| |
|
| | ```python |
| | from huggingface_hub import hf_hub_download |
| | from safetensors.torch import load_file |
| | |
| | har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors") |
| | result = model.load_state_dict(load_file(har_path, device=str(device)), strict=False) |
| | print(f"Loaded HAR checkpoint — missing: {len(result.missing_keys)}, unexpected: {len(result.unexpected_keys)}") |
| | ``` |
| |
|
| | ### SFT Inference: Question Answering over Sensor Data |
| |
|
| | The SFT checkpoints enable natural-language Q&A directly on sensor signals. Each sample pairs a multivariate time series with a formatted prompt; the model generates a chain-of-thought reasoning trace followed by the final answer. |
| |
|
| | **Input format** (from the SFT dataset): |
| | ``` |
| | [sensor description / context] |
| | Question: <question about the sensor data> |
| | Answer: |
| | ``` |
| | The model continues from `Answer:` and produces the full response. |
| |
|
| | **End-to-end inference example** (using `har_cot` as an example task): |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModel, AutoTokenizer |
| | from huggingface_hub import hf_hub_download |
| | from safetensors.torch import load_file |
| | from torch.utils.data import DataLoader |
| | from util.dataset import SftDataset, SFTCollator |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | # 1. Load base model and tokenizer |
| | model = AutoModel.from_pretrained("LeoChen085/SLIP", trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m") |
| | model.eval().to(device) |
| | |
| | # 2. Swap in the HAR SFT checkpoint |
| | har_path = hf_hub_download("LeoChen085/SLIP", "har.safetensors") |
| | model.load_state_dict(load_file(har_path, device=str(device)), strict=False) |
| | |
| | # 3. Load SFT test data (auto-downloaded from HuggingFace) |
| | test_set = SftDataset("har_cot", split="test", hf_repo="LeoChen085/SlipSFTDataset") |
| | # is_test=True feeds only the prompt; answer is held out for evaluation |
| | loader = DataLoader(test_set, batch_size=8, |
| | collate_fn=SFTCollator(tokenizer, max_len=2880, is_test=True)) |
| | |
| | batch = next(iter(loader)) |
| | sensor = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["sensor"].items()} |
| | text = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in batch["text"].items()} |
| | |
| | # 4. Generate the answer |
| | with torch.no_grad(): |
| | output_ids = model.generate(text, sensor, max_new_tokens=200) |
| | |
| | # Strip the prompt from the output — keep only the newly generated tokens |
| | prompts = tokenizer.batch_decode(text["input_ids"], skip_special_tokens=True) |
| | answers = tokenizer.batch_decode(output_ids, skip_special_tokens=True) |
| | ground_truths = text["labels"] # list of strings when is_test=True |
| | |
| | idx = 3 |
| | answer_only = answers[idx][len(prompts[idx]):].strip() |
| | |
| | print("=== Model answer ===") |
| | print(answer_only) |
| | # The accelerometer data over the 2.56 second window shows relatively low |
| | # variability and consistent patterns across the X, Y, and Z axes. The lack of |
| | # large, rapid changes in acceleration across all axes suggests minimal physical |
| | # activity, consistent with a stationary position. Answer: sitting. |
| | |
| | print("\n=== Ground truth ===") |
| | print(ground_truths[idx]) |
| | # The sustained low variability following the initial adjustment is characteristic |
| | # of a sedentary behavior. Answer: sitting. |
| | ``` |
| |
|
| | **Available SFT tasks and their checkpoints:** |
| |
|
| | | Task | Checkpoint | Description | |
| | |------|-----------|-------------| |
| | | `har_cot` | `har.safetensors` | Human activity recognition with chain-of-thought (walking, running, cycling, …) | |
| | | `sleep_cot` | `sleep.safetensors` | Sleep stage classification with CoT (Wake, N1, N2, N3, REM) | |
| | | `ecg_cot` | `ecg.safetensors` | ECG morphology QA with CoT (normal/abnormal, rhythm, intervals) | |
| | | `tsqa` | `tsqa.safetensors` | General time-series multiple-choice QA | |
| | | `m4_caption` | `caption.safetensors` | Free-form natural-language captioning of M4 sensor traces | |
| |
|
| | Replace `"har_cot"` / `"har.safetensors"` with any row from the table above to switch tasks. |
| |
|
| | ## Evaluation Datasets |
| |
|
| | The 11 evaluation datasets span four domains: |
| |
|
| | | Domain | Datasets | |
| | |--------|----------| |
| | | Activity Recognition | WISDM, UCI-HAR | |
| | | Clinical Diagnosis | Stroke (PPG_CVA), Diabetes (PPG_DM), Hypertension (PPG_HTN), Sleep Stage (sleepEDF), Heart Condition (ptbxl) | |
| | | Stress Prediction | WESAD, StudentLife | |
| | | Urban Sensing | AsphaltObstacles, Beijing AQI | |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @article{chen2026slip, |
| | title={Learning Transferable Sensor Models via Language-Informed Pretraining}, |
| | author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew}, |
| | journal={Preprint}, |
| | year={2026} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | This project is licensed under the [MIT License](LICENSE). |
| | |