Update README.md
Browse files
README.md
CHANGED
|
@@ -12,4 +12,214 @@ base_model:
|
|
| 12 |
- meta-llama/Llama-3.1-8B
|
| 13 |
tags:
|
| 14 |
- code
|
| 15 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
- meta-llama/Llama-3.1-8B
|
| 13 |
tags:
|
| 14 |
- code
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# 📌 HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
|
| 18 |
+
|
| 19 |
+
Arxiv: https://arxiv.org/pdf/2508.06475?
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 📖 Introduction
|
| 24 |
+
**HapticLLaMA** is a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category.
|
| 25 |
+
HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement
|
| 26 |
+
learning from human feedback (RLHF).
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 🧩 Tasks
|
| 31 |
+
- Given a vibration signal S and a target category c ∈ {sensory, emotional, associative}, where sensory refers to physical attributes (e.g.,intensity of tapping), emotional denotes affective
|
| 32 |
+
impressions (e.g., the mood of a scene), and associative indicates real-world familiar experiences (e.g., buzzing of a bee, a heartbeat), the goal is to generate a caption corresponding to the specified category of haptic experience.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## 📂 Training
|
| 37 |
+
HapticLLaMA training is consist of (1) supervised fine-tuning with LoRA adaptation and (2) subsequent fine-tuning based on human feedback on generated captions.
|
| 38 |
+
|
| 39 |
+
<img width="925" height="557" alt="image" src="https://github.com/user-attachments/assets/28a0aa75-d011-4870-b9ec-b9b3607eb8d8" />
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
- ## 📂 Models
|
| 43 |
+
- **Frequency-based Model**:
|
| 44 |
+
|
| 45 |
+
- **Encodec-based Model**:
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## 📂 Haptic Tokenizer
|
| 49 |
+
- **Frequency-based Tokenizer**:
|
| 50 |
+
|
| 51 |
+
<img width="361" height="211" alt="image" src="https://github.com/user-attachments/assets/ca848d0b-18d5-4ad5-89e4-268399aad801" />
|
| 52 |
+
|
| 53 |
+
Frequency-based Tokenizer divides the frequency range into logarithmically spaced bins that correspond to just-noticeable ifferences in human frequency perception. Similarly, the amplitude range is segmented into normalized levels. The tokenizer then assigns a unique
|
| 54 |
+
token (e.g., FREQ_3_AMP_2) to each frequencyamplitude pair, encoding the signal’s spectral content into a form interpretable by LLMs.
|
| 55 |
+
```python
|
| 56 |
+
import librosa
|
| 57 |
+
|
| 58 |
+
def steps_binning(frequencies, amplitudes, freq_bins=10, amp_levels=5):
|
| 59 |
+
|
| 60 |
+
freq_min, freq_max = np.min(frequencies), np.max(frequencies)
|
| 61 |
+
freq_min = freq_max / (1.2**(freq_bins-1))
|
| 62 |
+
freq_edges = np.geomspace(freq_min, freq_min * 1.2**(freq_bins-1), num=freq_bins)
|
| 63 |
+
freq_labels = [f"FREQ_{i+1}" for i in range(freq_bins)]
|
| 64 |
+
amp_min, amp_max = np.min(amplitudes), np.max(amplitudes)
|
| 65 |
+
if amp_min == amp_max:
|
| 66 |
+
# breakpoint()
|
| 67 |
+
amplitudes = np.zeros_like(frequencies)
|
| 68 |
+
amp_edges = np.linspace(0, 1, amp_levels + 1)
|
| 69 |
+
else:
|
| 70 |
+
amplitudes = (amplitudes - amp_min) / (amp_max - amp_min)
|
| 71 |
+
amp_min = amp_max / (1.2**(amp_levels-1))
|
| 72 |
+
amp_edges = np.geomspace(amp_min, amp_max, num=amp_levels)
|
| 73 |
+
|
| 74 |
+
amp_labels = [f"AMP_{i+1}" for i in range(amp_levels)]
|
| 75 |
+
|
| 76 |
+
tokens = []
|
| 77 |
+
for f, a in zip(frequencies, amplitudes):
|
| 78 |
+
freq_bin = np.digitize(f, freq_edges) - 1
|
| 79 |
+
freq_bin = min(freq_bin, freq_bins - 1)
|
| 80 |
+
freq_token = freq_labels[freq_bin]
|
| 81 |
+
|
| 82 |
+
amp_bin = np.digitize(a, amp_edges) - 1
|
| 83 |
+
amp_bin = min(amp_bin, amp_levels - 1)
|
| 84 |
+
amp_token = amp_labels[amp_bin]
|
| 85 |
+
|
| 86 |
+
tokens.append(f"{freq_token}_{amp_token}")
|
| 87 |
+
return tokens
|
| 88 |
+
|
| 89 |
+
### start load .wav file and tokenize
|
| 90 |
+
y, sr = librosa.load(wav_file, sr=None)
|
| 91 |
+
|
| 92 |
+
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
|
| 93 |
+
frequencies = librosa.fft_frequencies(sr=sr, n_fft=n_fft)
|
| 94 |
+
magnitudes = np.abs(D)
|
| 95 |
+
magnitudes = magnitudes / np.max(magnitudes)
|
| 96 |
+
frame_idx = 10
|
| 97 |
+
amplitudes = magnitudes[:, frame_idx]
|
| 98 |
+
mask = frequencies < 500
|
| 99 |
+
frequencies_filtered = frequencies[mask]
|
| 100 |
+
amplitudes_filtered = amplitudes[mask]
|
| 101 |
+
###haptic tokens based on Frequency-base haptic tokenizer
|
| 102 |
+
tokens = steps_binning(frequencies_filtered, amplitudes_filtered, freq_bins=freq_bins,amp_levels=amp_levels)
|
| 103 |
+
|
| 104 |
+
```
|
| 105 |
+
---
|
| 106 |
+
- **EnCodec-based Tokenizer**:
|
| 107 |
+
|
| 108 |
+
<img width="317" height="172" alt="image" src="https://github.com/user-attachments/assets/35e50d2e-c21f-4fc1-8953-74305a752ee0" />
|
| 109 |
+
|
| 110 |
+
EnCodec is a neural audio codec that compresses audio using deep learning (Défossez et al., 2023). It consists of three
|
| 111 |
+
main components: (1) an encoder that transforms raw audio into a lower-dimensional latent representation, (2) a quantizer that discretizes the latent features via residual vector quantization, and (3) a decoder that reconstructs the waveform from the quantized codes. EnCodec-based tokenizer extract the codes from residual vector quantization in the audio compression architecture.
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
from transformers import AutoTokenizer,AutoProcessor,EncodecModel
|
| 115 |
+
|
| 116 |
+
encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz")
|
| 117 |
+
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
|
| 118 |
+
|
| 119 |
+
### EnCodec-based Tokenizer
|
| 120 |
+
def encodec_token(wav_file):
|
| 121 |
+
data_dict = {"audio": [wav_file]}
|
| 122 |
+
data_dataset = Dataset.from_dict(data_dict).cast_column("audio", Audio())
|
| 123 |
+
audio_sample = data_dataset[-1]["audio"]["array"]
|
| 124 |
+
inputs = processor(raw_audio=audio_sample, sampling_rate=24000, return_tensors="pt")
|
| 125 |
+
with torch.no_grad():
|
| 126 |
+
encoded_frames = encodec_model.encode(inputs["input_values"], inputs["padding_mask"])
|
| 127 |
+
tokens = encoded_frames.audio_codes[0][0]
|
| 128 |
+
tokens_list = [str(token) for token in tokens[0].tolist()]
|
| 129 |
+
|
| 130 |
+
return tokens_list
|
| 131 |
+
```
|
| 132 |
+
---
|
| 133 |
+
## 📂 Inference
|
| 134 |
+
|
| 135 |
+
Given a haptic signal, we prompt HapticLLaMA to generate captions from sensory, emotional, and associative perspectives.
|
| 136 |
+
|
| 137 |
+
<img width="448" height="329" alt="image" src="https://github.com/user-attachments/assets/2ea17083-5da3-47f2-9781-7f17912d08cc" />
|
| 138 |
+
|
| 139 |
+
```python
|
| 140 |
+
import torch
|
| 141 |
+
from torch import nn
|
| 142 |
+
import librosa
|
| 143 |
+
|
| 144 |
+
#load model--HapticLLaMA
|
| 145 |
+
def load_model(stage, device, mode, model_file_url):
|
| 146 |
+
if os.path.exists(model_file_url):
|
| 147 |
+
model = Model(args, mode=mode)
|
| 148 |
+
lora_state_dict = torch.load(model_file_url)
|
| 149 |
+
state_name, model_name = [], []
|
| 150 |
+
for name, param in model.named_parameters():
|
| 151 |
+
model_name.append(name)
|
| 152 |
+
for name in lora_state_dict.keys():
|
| 153 |
+
state_name.append(name)
|
| 154 |
+
missing_keys, unexpected_keys = model.load_state_dict(lora_state_dict, strict=False)
|
| 155 |
+
model.to(device)
|
| 156 |
+
else:
|
| 157 |
+
print('invalid model url!')
|
| 158 |
+
model = None
|
| 159 |
+
return model
|
| 160 |
+
|
| 161 |
+
###load pretrained haptic tokenizer
|
| 162 |
+
|
| 163 |
+
frequency_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_steps_binning.pt/")
|
| 164 |
+
encodec_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_encodec.pt/")
|
| 165 |
+
|
| 166 |
+
#formalize input for inference
|
| 167 |
+
def tokenizer_haptic(haptic, prompt, mode):
|
| 168 |
+
|
| 169 |
+
def formalize_input(haptic_tokens, tokenizer, prompt):
|
| 170 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 171 |
+
|
| 172 |
+
inputs = tokenizer(haptic_tokens, padding=True, truncation=True, return_tensors="pt")
|
| 173 |
+
input_ids = inputs.input_ids
|
| 174 |
+
input_atts = inputs.attention_mask
|
| 175 |
+
|
| 176 |
+
prompt_enc = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt")
|
| 177 |
+
prompt_ids = prompt_enc.input_ids
|
| 178 |
+
prompt_atts = prompt_enc.attention_mask
|
| 179 |
+
|
| 180 |
+
prompt_ids = torch.cat((input_ids,prompt_ids),dim=1)
|
| 181 |
+
prompt_atts = torch.cat((input_atts,prompt_atts),dim=1)
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
return input_ids,input_atts, prompt_ids, prompt_atts
|
| 185 |
+
|
| 186 |
+
###Frequency-based token formalization
|
| 187 |
+
if mode == 'frequency':
|
| 188 |
+
freq_haptic_tokens = frequency_tokenizer(haptic, mode='frequency)
|
| 189 |
+
freq_haptic_tokens = [' '.join(freq_haptic_tokens)]
|
| 190 |
+
freq_input_ids,freq_input_atts, freq_prompt_ids, freq_prompt_atts = formalize_input(freq_haptic_tokens, frequency_tokenizer, prompt=prompt)
|
| 191 |
+
return freq_input_ids, freq_input_atts, freq_prompt_ids, freq_prompt_atts
|
| 192 |
+
elif mode == 'encodec':
|
| 193 |
+
###Encodec-based token formalization
|
| 194 |
+
encodec_haptic_tokens = encodec_token(haptic, mode='encodec')
|
| 195 |
+
encodec_haptic_tokens = [' '.join(encodec_haptic_tokens)]
|
| 196 |
+
encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts = formalize_input(encodec_haptic_tokens, encodec_tokenizer, prompt=prompt)
|
| 197 |
+
return encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts
|
| 198 |
+
|
| 199 |
+
```
|
| 200 |
+
Inference for one sample
|
| 201 |
+
|
| 202 |
+
```python
|
| 203 |
+
haptic_signal = r'./F211_loop.wav'
|
| 204 |
+
sensory_prompt = 'its sensory description is'
|
| 205 |
+
##for emotional and associative
|
| 206 |
+
##emotional_prompt = 'its emotional description is'
|
| 207 |
+
##associative_prompt = 'its associative description is'
|
| 208 |
+
input_ids, input_atts, prompt_ids, prompt_atts = tokenizer_haptic(haptic_signal, sensory_prompt, mode='encodec')
|
| 209 |
+
hapticllama = load_model(stage=1, device='cuda', mode='encodec', model_file_url=encodec_model_file_url)
|
| 210 |
+
caption = hapticllama.generate(inputs = prompt_ids,input_atts=prompt_atts)
|
| 211 |
+
print(caption)
|
| 212 |
+
```
|
| 213 |
+
---
|
| 214 |
+
|
| 215 |
+
## 🚀 Citation
|
| 216 |
+
If you find this dataset useful for your research, please cite our paper:
|
| 217 |
+
|
| 218 |
+
```bibtex
|
| 219 |
+
@article{hu2025hapticllama,
|
| 220 |
+
title={HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning},
|
| 221 |
+
author={Hu, Guimin and Hershcovich, Daniel and Seifi, Hasti},
|
| 222 |
+
journal={arXiv preprint arXiv:2508.06475},
|
| 223 |
+
year={2025}
|
| 224 |
+
}
|
| 225 |
+
```
|