File size: 9,226 Bytes
8ac62f6 f4658ef 8ac62f6 6fd85f5 5aee646 cb77e50 6fd85f5 ed3d3f8 6fd85f5 f4658ef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
---
license: apache-2.0
datasets:
- GuiminHu/HapticCap
- GuiminHu/VibRate
language:
- en
metrics:
- bleu
- meteor
- rouge
base_model:
- meta-llama/Llama-3.1-8B
tags:
- code
---
# 📌 HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
Arxiv: https://arxiv.org/pdf/2508.06475?
Codes: https://github.com/LeMei/HapticLLaMA
---
## 📖 Introduction
**HapticLLaMA** is a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category.
HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement
learning from human feedback (RLHF).
---
## 🧩 Tasks
Given a vibration signal S and a target category c ∈ {sensory, emotional, associative}, where sensory refers to physical attributes (e.g.,intensity of tapping), emotional denotes affective
impressions (e.g., the mood of a scene), and associative indicates real-world familiar experiences (e.g., buzzing of a bee, a heartbeat), the goal is to generate a caption corresponding to the specified category of haptic experience.
---
## 📂 Training
HapticLLaMA training is consist of (1) supervised fine-tuning with LoRA adaptation and (2) subsequent fine-tuning based on human feedback on generated captions.
<img width="925" height="557" alt="image" src="https://github.com/user-attachments/assets/28a0aa75-d011-4870-b9ec-b9b3607eb8d8" />
---
## 📂 Haptic Tokenizer
- **Frequency-based Tokenizer**:
<img width="361" height="211" alt="image" src="https://github.com/user-attachments/assets/ca848d0b-18d5-4ad5-89e4-268399aad801" />
Frequency-based Tokenizer divides the frequency range into logarithmically spaced bins that correspond to just-noticeable ifferences in human frequency perception. Similarly, the amplitude range is segmented into normalized levels. The tokenizer then assigns a unique
token (e.g., FREQ_3_AMP_2) to each frequencyamplitude pair, encoding the signal’s spectral content into a form interpretable by LLMs.
```python
import librosa
def steps_binning(frequencies, amplitudes, freq_bins=10, amp_levels=5):
freq_min, freq_max = np.min(frequencies), np.max(frequencies)
freq_min = freq_max / (1.2**(freq_bins-1))
freq_edges = np.geomspace(freq_min, freq_min * 1.2**(freq_bins-1), num=freq_bins)
freq_labels = [f"FREQ_{i+1}" for i in range(freq_bins)]
amp_min, amp_max = np.min(amplitudes), np.max(amplitudes)
if amp_min == amp_max:
# breakpoint()
amplitudes = np.zeros_like(frequencies)
amp_edges = np.linspace(0, 1, amp_levels + 1)
else:
amplitudes = (amplitudes - amp_min) / (amp_max - amp_min)
amp_min = amp_max / (1.2**(amp_levels-1))
amp_edges = np.geomspace(amp_min, amp_max, num=amp_levels)
amp_labels = [f"AMP_{i+1}" for i in range(amp_levels)]
tokens = []
for f, a in zip(frequencies, amplitudes):
freq_bin = np.digitize(f, freq_edges) - 1
freq_bin = min(freq_bin, freq_bins - 1)
freq_token = freq_labels[freq_bin]
amp_bin = np.digitize(a, amp_edges) - 1
amp_bin = min(amp_bin, amp_levels - 1)
amp_token = amp_labels[amp_bin]
tokens.append(f"{freq_token}_{amp_token}")
return tokens
### start load .wav file and tokenize
y, sr = librosa.load(wav_file, sr=None)
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
frequencies = librosa.fft_frequencies(sr=sr, n_fft=n_fft)
magnitudes = np.abs(D)
magnitudes = magnitudes / np.max(magnitudes)
frame_idx = 10
amplitudes = magnitudes[:, frame_idx]
mask = frequencies < 500
frequencies_filtered = frequencies[mask]
amplitudes_filtered = amplitudes[mask]
###haptic tokens based on Frequency-base haptic tokenizer
tokens = steps_binning(frequencies_filtered, amplitudes_filtered, freq_bins=freq_bins,amp_levels=amp_levels)
```
---
- **EnCodec-based Tokenizer**:
<img width="317" height="172" alt="image" src="https://github.com/user-attachments/assets/35e50d2e-c21f-4fc1-8953-74305a752ee0" />
EnCodec is a neural audio codec that compresses audio using deep learning (Défossez et al., 2023). It consists of three
main components: (1) an encoder that transforms raw audio into a lower-dimensional latent representation, (2) a quantizer that discretizes the latent features via residual vector quantization, and (3) a decoder that reconstructs the waveform from the quantized codes. EnCodec-based tokenizer extract the codes from residual vector quantization in the audio compression architecture.
```python
from transformers import AutoTokenizer,AutoProcessor,EncodecModel
encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz")
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
### EnCodec-based Tokenizer
def encodec_token(wav_file):
data_dict = {"audio": [wav_file]}
data_dataset = Dataset.from_dict(data_dict).cast_column("audio", Audio())
audio_sample = data_dataset[-1]["audio"]["array"]
inputs = processor(raw_audio=audio_sample, sampling_rate=24000, return_tensors="pt")
with torch.no_grad():
encoded_frames = encodec_model.encode(inputs["input_values"], inputs["padding_mask"])
tokens = encoded_frames.audio_codes[0][0]
tokens_list = [str(token) for token in tokens[0].tolist()]
return tokens_list
```
---
## 📂 Inference
Given a haptic signal, we prompt HapticLLaMA to generate captions from sensory, emotional, and associative perspectives.
<img width="448" height="329" alt="image" src="https://github.com/user-attachments/assets/2ea17083-5da3-47f2-9781-7f17912d08cc" />
```python
import torch
from torch import nn
import librosa
#load model--HapticLLaMA
def load_model(stage, device, mode, model_file_url):
if os.path.exists(model_file_url):
model = Model(args, mode=mode)
lora_state_dict = torch.load(model_file_url)
state_name, model_name = [], []
for name, param in model.named_parameters():
model_name.append(name)
for name in lora_state_dict.keys():
state_name.append(name)
missing_keys, unexpected_keys = model.load_state_dict(lora_state_dict, strict=False)
model.to(device)
else:
print('invalid model url!')
model = None
return model
###load pretrained haptic tokenizer
frequency_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_steps_binning.pt/")
encodec_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_encodec.pt/")
#formalize input for inference
def tokenizer_haptic(haptic, prompt, mode):
def formalize_input(haptic_tokens, tokenizer, prompt):
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(haptic_tokens, padding=True, truncation=True, return_tensors="pt")
input_ids = inputs.input_ids
input_atts = inputs.attention_mask
prompt_enc = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt")
prompt_ids = prompt_enc.input_ids
prompt_atts = prompt_enc.attention_mask
prompt_ids = torch.cat((input_ids,prompt_ids),dim=1)
prompt_atts = torch.cat((input_atts,prompt_atts),dim=1)
return input_ids,input_atts, prompt_ids, prompt_atts
###Frequency-based token formalization
if mode == 'frequency':
freq_haptic_tokens = frequency_tokenizer(haptic, mode='frequency)
freq_haptic_tokens = [' '.join(freq_haptic_tokens)]
freq_input_ids,freq_input_atts, freq_prompt_ids, freq_prompt_atts = formalize_input(freq_haptic_tokens, frequency_tokenizer, prompt=prompt)
return freq_input_ids, freq_input_atts, freq_prompt_ids, freq_prompt_atts
elif mode == 'encodec':
###Encodec-based token formalization
encodec_haptic_tokens = encodec_token(haptic, mode='encodec')
encodec_haptic_tokens = [' '.join(encodec_haptic_tokens)]
encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts = formalize_input(encodec_haptic_tokens, encodec_tokenizer, prompt=prompt)
return encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts
```
Inference for one sample
```python
haptic_signal = r'./F211_loop.wav'
sensory_prompt = 'its sensory description is'
##for emotional and associative
##emotional_prompt = 'its emotional description is'
##associative_prompt = 'its associative description is'
input_ids, input_atts, prompt_ids, prompt_atts = tokenizer_haptic(haptic_signal, sensory_prompt, mode='encodec')
hapticllama = load_model(stage=1, device='cuda', mode='encodec', model_file_url=encodec_model_file_url)
caption = hapticllama.generate(inputs = prompt_ids,input_atts=prompt_atts)
print(caption)
```
---
## 🚀 Citation
If you find this dataset useful for your research, please cite our paper:
```bibtex
@article{hu2025hapticllama,
title={HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning},
author={Hu, Guimin and Hershcovich, Daniel and Seifi, Hasti},
journal={arXiv preprint arXiv:2508.06475},
year={2025}
}
``` |