|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- GuiminHu/HapticCap |
|
|
- GuiminHu/VibRate |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- bleu |
|
|
- meteor |
|
|
- rouge |
|
|
base_model: |
|
|
- meta-llama/Llama-3.1-8B |
|
|
tags: |
|
|
- code |
|
|
--- |
|
|
|
|
|
# 📌 HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning |
|
|
|
|
|
Arxiv: https://arxiv.org/pdf/2508.06475? |
|
|
|
|
|
Codes: https://github.com/LeMei/HapticLLaMA |
|
|
|
|
|
--- |
|
|
|
|
|
## 📖 Introduction |
|
|
**HapticLLaMA** is a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. |
|
|
HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement |
|
|
learning from human feedback (RLHF). |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Tasks |
|
|
Given a vibration signal S and a target category c ∈ {sensory, emotional, associative}, where sensory refers to physical attributes (e.g.,intensity of tapping), emotional denotes affective |
|
|
impressions (e.g., the mood of a scene), and associative indicates real-world familiar experiences (e.g., buzzing of a bee, a heartbeat), the goal is to generate a caption corresponding to the specified category of haptic experience. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📂 Training |
|
|
HapticLLaMA training is consist of (1) supervised fine-tuning with LoRA adaptation and (2) subsequent fine-tuning based on human feedback on generated captions. |
|
|
|
|
|
<img width="925" height="557" alt="image" src="https://github.com/user-attachments/assets/28a0aa75-d011-4870-b9ec-b9b3607eb8d8" /> |
|
|
|
|
|
--- |
|
|
|
|
|
## 📂 Haptic Tokenizer |
|
|
- **Frequency-based Tokenizer**: |
|
|
|
|
|
<img width="361" height="211" alt="image" src="https://github.com/user-attachments/assets/ca848d0b-18d5-4ad5-89e4-268399aad801" /> |
|
|
|
|
|
Frequency-based Tokenizer divides the frequency range into logarithmically spaced bins that correspond to just-noticeable ifferences in human frequency perception. Similarly, the amplitude range is segmented into normalized levels. The tokenizer then assigns a unique |
|
|
token (e.g., FREQ_3_AMP_2) to each frequencyamplitude pair, encoding the signal’s spectral content into a form interpretable by LLMs. |
|
|
```python |
|
|
import librosa |
|
|
|
|
|
def steps_binning(frequencies, amplitudes, freq_bins=10, amp_levels=5): |
|
|
|
|
|
freq_min, freq_max = np.min(frequencies), np.max(frequencies) |
|
|
freq_min = freq_max / (1.2**(freq_bins-1)) |
|
|
freq_edges = np.geomspace(freq_min, freq_min * 1.2**(freq_bins-1), num=freq_bins) |
|
|
freq_labels = [f"FREQ_{i+1}" for i in range(freq_bins)] |
|
|
amp_min, amp_max = np.min(amplitudes), np.max(amplitudes) |
|
|
if amp_min == amp_max: |
|
|
# breakpoint() |
|
|
amplitudes = np.zeros_like(frequencies) |
|
|
amp_edges = np.linspace(0, 1, amp_levels + 1) |
|
|
else: |
|
|
amplitudes = (amplitudes - amp_min) / (amp_max - amp_min) |
|
|
amp_min = amp_max / (1.2**(amp_levels-1)) |
|
|
amp_edges = np.geomspace(amp_min, amp_max, num=amp_levels) |
|
|
|
|
|
amp_labels = [f"AMP_{i+1}" for i in range(amp_levels)] |
|
|
|
|
|
tokens = [] |
|
|
for f, a in zip(frequencies, amplitudes): |
|
|
freq_bin = np.digitize(f, freq_edges) - 1 |
|
|
freq_bin = min(freq_bin, freq_bins - 1) |
|
|
freq_token = freq_labels[freq_bin] |
|
|
|
|
|
amp_bin = np.digitize(a, amp_edges) - 1 |
|
|
amp_bin = min(amp_bin, amp_levels - 1) |
|
|
amp_token = amp_labels[amp_bin] |
|
|
|
|
|
tokens.append(f"{freq_token}_{amp_token}") |
|
|
return tokens |
|
|
|
|
|
### start load .wav file and tokenize |
|
|
y, sr = librosa.load(wav_file, sr=None) |
|
|
|
|
|
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length) |
|
|
frequencies = librosa.fft_frequencies(sr=sr, n_fft=n_fft) |
|
|
magnitudes = np.abs(D) |
|
|
magnitudes = magnitudes / np.max(magnitudes) |
|
|
frame_idx = 10 |
|
|
amplitudes = magnitudes[:, frame_idx] |
|
|
mask = frequencies < 500 |
|
|
frequencies_filtered = frequencies[mask] |
|
|
amplitudes_filtered = amplitudes[mask] |
|
|
###haptic tokens based on Frequency-base haptic tokenizer |
|
|
tokens = steps_binning(frequencies_filtered, amplitudes_filtered, freq_bins=freq_bins,amp_levels=amp_levels) |
|
|
|
|
|
``` |
|
|
--- |
|
|
- **EnCodec-based Tokenizer**: |
|
|
|
|
|
<img width="317" height="172" alt="image" src="https://github.com/user-attachments/assets/35e50d2e-c21f-4fc1-8953-74305a752ee0" /> |
|
|
|
|
|
EnCodec is a neural audio codec that compresses audio using deep learning (Défossez et al., 2023). It consists of three |
|
|
main components: (1) an encoder that transforms raw audio into a lower-dimensional latent representation, (2) a quantizer that discretizes the latent features via residual vector quantization, and (3) a decoder that reconstructs the waveform from the quantized codes. EnCodec-based tokenizer extract the codes from residual vector quantization in the audio compression architecture. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer,AutoProcessor,EncodecModel |
|
|
|
|
|
encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz") |
|
|
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz") |
|
|
|
|
|
### EnCodec-based Tokenizer |
|
|
def encodec_token(wav_file): |
|
|
data_dict = {"audio": [wav_file]} |
|
|
data_dataset = Dataset.from_dict(data_dict).cast_column("audio", Audio()) |
|
|
audio_sample = data_dataset[-1]["audio"]["array"] |
|
|
inputs = processor(raw_audio=audio_sample, sampling_rate=24000, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
encoded_frames = encodec_model.encode(inputs["input_values"], inputs["padding_mask"]) |
|
|
tokens = encoded_frames.audio_codes[0][0] |
|
|
tokens_list = [str(token) for token in tokens[0].tolist()] |
|
|
|
|
|
return tokens_list |
|
|
``` |
|
|
--- |
|
|
## 📂 Inference |
|
|
|
|
|
Given a haptic signal, we prompt HapticLLaMA to generate captions from sensory, emotional, and associative perspectives. |
|
|
|
|
|
<img width="448" height="329" alt="image" src="https://github.com/user-attachments/assets/2ea17083-5da3-47f2-9781-7f17912d08cc" /> |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from torch import nn |
|
|
import librosa |
|
|
|
|
|
#load model--HapticLLaMA |
|
|
def load_model(stage, device, mode, model_file_url): |
|
|
if os.path.exists(model_file_url): |
|
|
model = Model(args, mode=mode) |
|
|
lora_state_dict = torch.load(model_file_url) |
|
|
state_name, model_name = [], [] |
|
|
for name, param in model.named_parameters(): |
|
|
model_name.append(name) |
|
|
for name in lora_state_dict.keys(): |
|
|
state_name.append(name) |
|
|
missing_keys, unexpected_keys = model.load_state_dict(lora_state_dict, strict=False) |
|
|
model.to(device) |
|
|
else: |
|
|
print('invalid model url!') |
|
|
model = None |
|
|
return model |
|
|
|
|
|
###load pretrained haptic tokenizer |
|
|
|
|
|
frequency_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_steps_binning.pt/") |
|
|
encodec_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_encodec.pt/") |
|
|
|
|
|
#formalize input for inference |
|
|
def tokenizer_haptic(haptic, prompt, mode): |
|
|
|
|
|
def formalize_input(haptic_tokens, tokenizer, prompt): |
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
|
|
inputs = tokenizer(haptic_tokens, padding=True, truncation=True, return_tensors="pt") |
|
|
input_ids = inputs.input_ids |
|
|
input_atts = inputs.attention_mask |
|
|
|
|
|
prompt_enc = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt") |
|
|
prompt_ids = prompt_enc.input_ids |
|
|
prompt_atts = prompt_enc.attention_mask |
|
|
|
|
|
prompt_ids = torch.cat((input_ids,prompt_ids),dim=1) |
|
|
prompt_atts = torch.cat((input_atts,prompt_atts),dim=1) |
|
|
|
|
|
|
|
|
return input_ids,input_atts, prompt_ids, prompt_atts |
|
|
|
|
|
###Frequency-based token formalization |
|
|
if mode == 'frequency': |
|
|
freq_haptic_tokens = frequency_tokenizer(haptic, mode='frequency) |
|
|
freq_haptic_tokens = [' '.join(freq_haptic_tokens)] |
|
|
freq_input_ids,freq_input_atts, freq_prompt_ids, freq_prompt_atts = formalize_input(freq_haptic_tokens, frequency_tokenizer, prompt=prompt) |
|
|
return freq_input_ids, freq_input_atts, freq_prompt_ids, freq_prompt_atts |
|
|
elif mode == 'encodec': |
|
|
###Encodec-based token formalization |
|
|
encodec_haptic_tokens = encodec_token(haptic, mode='encodec') |
|
|
encodec_haptic_tokens = [' '.join(encodec_haptic_tokens)] |
|
|
encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts = formalize_input(encodec_haptic_tokens, encodec_tokenizer, prompt=prompt) |
|
|
return encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts |
|
|
|
|
|
``` |
|
|
Inference for one sample |
|
|
|
|
|
```python |
|
|
haptic_signal = r'./F211_loop.wav' |
|
|
sensory_prompt = 'its sensory description is' |
|
|
##for emotional and associative |
|
|
##emotional_prompt = 'its emotional description is' |
|
|
##associative_prompt = 'its associative description is' |
|
|
input_ids, input_atts, prompt_ids, prompt_atts = tokenizer_haptic(haptic_signal, sensory_prompt, mode='encodec') |
|
|
hapticllama = load_model(stage=1, device='cuda', mode='encodec', model_file_url=encodec_model_file_url) |
|
|
caption = hapticllama.generate(inputs = prompt_ids,input_atts=prompt_atts) |
|
|
print(caption) |
|
|
``` |
|
|
--- |
|
|
|
|
|
## 🚀 Citation |
|
|
If you find this dataset useful for your research, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@article{hu2025hapticllama, |
|
|
title={HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning}, |
|
|
author={Hu, Guimin and Hershcovich, Daniel and Seifi, Hasti}, |
|
|
journal={arXiv preprint arXiv:2508.06475}, |
|
|
year={2025} |
|
|
} |
|
|
``` |